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We review the current state of data mining and machine learning in astronomy. Data 
Mining can have a somewhat mixed connotation from the point of view of a researcher in 
this field. If used correctly, it can be a powerful approach, holding the potential to fully 
exploit the exponentially increasing amount of available data, promising great scientific 
advance. However, if misused, it can be little more than the black-box application of com- 
plex computing algorithms that may give little physical insight, and provide questionable 
results. Here, we give an overview of the entire data mining process, from data collection 
through to the interpretation of results. We cover common machine learning algorithms, 
such as artificial neural networks and support vector machines, applications from a broad 
range of astronomy, emphasizing those where data mining techniques directly resulted 
in improved science, and important current and future directions, including probability 
density functions, parallel algorithms, petascale computing, and the time domain. We 
conclude that, so long as one carefully selects an appropriate algorithm, and is guided 
by the astronomical problem at hand, data mining can be very much the powerful tool, 
and not the questionable black box. 

Keywords: Keywordl; keyword2; keywords. 



1. Introduction 

In its broadest sense, data mining is simply the act of turning raw data from an ob- 
servation into useful information. This information can be interpreted by hypothesis 
or theory, and used to make further predictions. This scientific method, where useful 
statements are made about the world, has been widely employed to great effect in 
the West since the Renaissance, and even earlier in other parts of the world. What 
has changed in the past few decades is the exponential rise in available computing 
power, and, as a related consequence, the enormous quantities of observed data, 
primarily in digital form. The exponential rise in the amount of available data is 
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now creating, in addition to the natural world, a digital world, in which extracting 
new and useful information from the data already taken and archived is becoming a 
major endeavor in itself. This action of knowledge discovery in databases (KDD), is 
what is most commonly inferred by the phrase data mining, and it forms the basis 
for our review. 

Astronomy has been among the first scientific disciplines to experience this fiood 
of data. The e mergence of data mining within this and other subjects has been 



pair of theory and observation, while the third is another relatively recent addi- 
tion, computer simulation. The sheer volume of data not only necessitates this new 
paradigmatic approach, but the approach must be, to a large extent, automated. 
In more formal terms, we wish to leverage a computational machine to find pat- 
terns in digital data, and translate these patterns into useful information, hence 
machine learning. This learning must be returned in a useful manner to a human 
investigator, which hopefully results in human learning. 

It is perhaps not entirely unfair to say, however, that scientists in general do not 
yet appreciate the full potential of this fourth paradigm. There are good reasons 
for this of course: scientists are generally not experts in databases, or cutting-edge 
branches of statistics, or computer hardware, and so forth. What we hope to do in 
this review, primarily for the data mining skeptic, is to shed light on why this is a 
useful approach. To accomplish this goal, we emphasize either algorithms that have 
or could currently be usefully employed, and the actual scientific results they have 
enabled. We also hope to give an interesting and fairly comprehensive overview to 
those who do already appreciate this approach, and perhaps provide inspiration for 
exciting new ideas and applications. However, despite referring to data mining as a 
whole new paradigm, we try to emphasize that it is, like theory, observation, and 
simulation, only a part of the broader scientific process, and should be viewed and 
utilized as such. The algorithms described are tools that, when applied correctly, 
have vast potential for the creation of useful scientific results. But, given that it 
is only part of the process, it is, of course, not the answer to everything, and we 
therefore enumerate some of the limitations of this new paradigm. 

We start in m.l\ with a summary of some of the advantages of this approach. 
In ^ we summarize the process from the input of raw data to the visualization 
of results. This is followed in ^ by the actual application of data mining tools in 
astronomy. ij2]is arranged algorithmically, and ^Slis arranged astrophysically. It is 
likely that the expert in astronomy or data mining, respectively, could infer much 
of ^ from fj2l and vice- versa. But it is unlikely (we hope) that the combination 
of the two sections does not have new ideas or insights to offer to either audience. 
Following these two sections, in ^ we combine the lessons learned to discuss the 
future of data mining in astronomy, pointing out likely near-term future directions 
in both the data mining process and its physical application. We conclude with a 
summary of the main points in ^ 




as the fourth paradigm. The first two paradig: 



;ms are the well-known 
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1.1. Why Data Mining? 

Of course, what astronomers care about is not a fashionable new computational 
method for ever more complex data analysis, but the science. A fancy new data 
mining system is not worth much if all it tells you is what you could have gained by 
the judicious application of existing tools and a little physical insig hP. We therefore 
summarize some of the advantages of this approach: 

• Getting anything at all: upcoming datasets will be almost overwhelmingly large. 
When one is faced with Petabytes of data, a rigorous, automated approach that 
intelligently extracts pertinent scientific information will be the only one that is 
tractable. 

• Simplicity: despite the apparent plethora of methods, straightforward applications 
of very well-known and well-tested data mining algorithms can quickly produce 
a useful result. These methods can generate a model appropriate to the com- 
plexity of an input dataset, including nonlinearities, implicit prior information, 
systematic biases, or unexpected patterns. With this approach, a priori data sam- 
pling of the type exemplified by elaborate color cuts, is not necessary. For many 
algorithms, new data can be trivially incorporated as they become available. 

• Prior information: this can be either fully incorporated, or the data can be allowed 
to completely 'speak for themselves'. For example, an unsupervised clustering 
algorithm can highlight new classes of objects within a dataset that might be 
missed if a prior set of classifications were imposed. 

• Pattern recognition: an appropriate algorithm can highlight patterns in a dataset 
that might not otherwise be noticed by a human investigator, perhaps due to the 
high dimensionality. Similarly, rare or unusual objects can be highlighted. 

• Complimentary approach: although there are numerous examples where the data 
mining approach demonstrably exceeds more traditional methods in terms of sci- 
entific return. Even when the approach does not produce a substantial improve- 
ment, it still acts as an important complementary method of analyzing data, 
because different approaches to an overall problem help to mitigate systematic 
errors in any one approach. 

2. Overview of Data Mining and Machine Learning Methods 

In this section, we review the data mining process. Specifically, as described in 
ijll this data mining review focuses on knowledge discovery in databases (KDD), 
although our definition of a 'database' is somewhat broad, essentially being any 
machine- readable astronomical data. As a result, this section is arranged algorith- 
mically. To avoid overlap with fj3] on the astronomical uses, we defer most of the 
application examples to that section. Nevertheless, all algorithms we describe have 
been, or are of sufficient maturity that they could immediately be applied to as- 
tronomical data. The reader who is expert in astronomy but not in data mining is 
advised to read this section to gain the full benefit from f|3l As in any specialized 
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subject, a certain level of jargon is necessary for clarity of expression. Terms likely 
to be unfamiliar to astronomers not versed in data mining are generally explained 
as they are introduced, but for addit ional b ackground we note that there are other 
useful reviews of the data mining 

fieldSMZl. 

Another recent overview of data mining 
in astronomy by Borne has also been published^. 



2.1. Data Collection 

The process of data collection encompasses all of the steps required to obtain the 
desired data in a digital format. Methods of data collection include acquiring and 
archiving new observations, querying existing databases according to the science 
problem at hand, and performing as necessary any cross-matching or data combin- 
ing, a process generically described as data fusion. 

A common motivation for cross-matching is the use of multiwavelength data, 
i.e., data spanning more than one of the regions of the electromagnetic spectrum 
(gamma ray, X-ray, ultraviolet, optical, infrared, millimeter, and radio). A common 
method in the absence of a definitive identification for each object spanning the 
datasets is to use the object's position on the sky with some astrometric tolerance, 
typically a few arcseconds. Cross-matching can introduce many issues including 
ambiguous matches, variations of the point spread function (resolution of objects) 
within or between datasets, differing survey footprints, survey masks, and large 
amounts of processing time and data transfer requirements when cross-matching 
large datasets. 

A major objective of the Virtual Observatory (VO, i )4.5l) is to make the data 
collection process more simple and tractable. Future VO webservices are planned 
that will perform several functions in this area, including cross-matches on large, 
widely distributed, heterogeneous data. 

Common astronomical data formats include a binary format, and plain 

ASCII, while an emerging format is VOTablJ^. Commonly used formats from other 
areas of data mining, such as attribute relation file format (ARFFfl are generally 
not widely used in astronomy. 



2.2. Preprocessing of Data 

Some data preprocessing may necessarily be part of the data collection process, for 
example, sample cuts in database queries. Preprocessing can be divided into steps 
that make the data to be read meaningful, and those that transform the data in 
some way as appropriate to a given algorithm. Data preprocessing is often problem- 
dependent, and should be carefully applied because the results of many data mining 
algorithms can be significantl y af fected by the input data. A useful overview of data 
preprocessing is given by 



^http : //weka. wiki . sourcef orge .net/ARFF| 
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Algorithms may require the object attributes, i.e., the values in the data fields 
describing the properties of each object, to be numerical or categorical, the latter 
being, e.g. 'star', or 'galaxy'. It is possible to transform numerical data to categorical 
and vice versa. 

A common categorical-to-numerical method is scalarization, in which different 
possible categorical attributes are given different numerical labels, for example, 
'star', 'galaxy', 'quasar' labeled as the vectors [1,0,0], [0,1,0], and [0,0,1], respectively. 
Note that for some algorithms, one should not label categories as, say, 1, 2 and 3, if 
the output of the algorithm is such that if it has confused an object between 1 and 3 
it labels the object as intermediate, in this case, 2. Here, 2 (galaxy) is certainly not 
an intermediate case between 1 (star) and 3 (quasar). One common algorithm in 
which such categorical but not ordered outputs could occur is a decision tree with 
multiple outputs. 

Numerical data can be made categorical by transformations such as binning. The 
bins may be user-specified, or can be generated optimally from the data . Binning 
can create numerical issues, including comparing two floating point numbers that 
should be identical, objects on a bin edge, empty bins, values that cannot be binned 
such as NaN, or values not within the bin range. 

Object attributes may need to be transformed. A common operation is the dif- 
ferencing of magnitudes to create colors. These transformations can introduce their 
own numerical issues, such as division by zero, or loss of accuracy. 

In general, data will contain one or more types of bad values, where the value is 
not correct. Examples include instances where the value has been set to something 
such as -9999 or NaN, the value appears correct but has been flagged as bad, or 
the value is not bad in a formatting sense but is clearly unphysical, perhaps a 
magnitude of a high value that could not have been detected by the instrument. 
They may need to be removed either by simply removing the object containing 
them, ignoring the bad value but using the remaining data, or interpolating a value 
using other information. Outliers may or may not be excluded, or may be excluded 
depending on their extremity. 

Data may also contain missing values. These values may be genuinely missing, 
for example in a cross-matched dataset where an object is not detected in a given 
waveband, or is not in an overlapping region of sky. It is also possible that the data 
should be present, but are missing for either a known reason, such as a bad camera 
pixel, a cosmic ray hit, or a reason that is simply not known. Some algorithms 
cannot be given missing values, which will require either the removal of the object 
or interpolation of the value from the existing data. The advisability of interpolation 
is problem-dependent. 

As well as bad values, the data may contain values that are correct but are 
outside the desired range of analysis. The data may therefore need to be sampled. 
There may simply be a desired range, such as magnitude or position on the sky, 
or the data may contain values that are correct but are outliers. Outliers may 
be included, included depending on their extremity (e.g., n standard deviations). 
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downweighted, or excluded. Alternatively, it may be more appropriate to generate 
a random subsample to produce a smaller dataset. 

Outside any normalization of the data prior to its use in the data mining algo- 
rithm, for example, calibration using standard sources, input or target attributes 
of the data will often be further normalized to improve the numerical conditioning 
of the algorithm. For example, if one axis of the n-dimensional space created by 
n input attributes encompasses a range that, numerically, is much larger than the 
other axes, it may dominate the results, or create conditions where very large and 
small numbers interact, causing loss of accuracy. Normalization can reduce this, 
and examples include linear transformations, like scaling by a given amount, scal- 
ing using the minimum and maximum values so that each attribute is in a given 
range such as 0-1, or scaling each attribute to have a mean of and a standard de- 
viation of 1. The latter example is known as standardization. A more sophisticated 
transformation with similar advantages is whitening, in which the values are not 
only scaled to a similar range, but correlations among the attributes are removed 
via transformation of their covariance matrix to the identity matrix. 

2.3. Attribute Selection 

In general, a large number of attributes will be available for each object in a dataset, 
and not all will be required for the problem. Indeed, use of all attributes may in 
many cases worsen performance. This is a well-known problem, often called the curse 
of dimensionality. The large number of attributes results in a high-dimensional 
space with many low density environments or even empty voids. This makes it 
difficult to generalize from the data and produce useful new results. One therefore 
requires some form of dimension reduction, in which one wishes to retain as much 
of the information as possible, but in fewer attributes. As well as the curse of 
dimensionality, some algorithms work less well with noisy, irrelevant, or redundant 
attributes. An example of an irrelevant attribute might be position on the sky 
for a survey with a uniform mask, because the position would then contain no 
information, and highly redundant attributes might be a color in the same waveband 
measured in two apertures. 

The most trivial form of dimension reduction is simply to use one's judgement 
and select a subset of attributes. Depending on the problem this can work well. Nev- 
ertheless, one can usually take a more sop histicated and less subjective approach, 
such as principal component analysis This is straightforward to im- 

plement, but is limited to linear relations. It gives, as the principal components, 
the eigenvectors of the input data, i.e., it picks out the directions which contain the 
greatest amount of information. Another straightforward approach is forward selec- 
tion, in which one starts with one attribute and selectively adds new attributes to 
gain the most information. Or, one can perform the equivalent process but starting 
with all of the attributes and removing them, known as backward elimination. 

In many ways, dimension reduction is similar to classification, in the sense that 
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a larger number of input attributes is reduced to a smaller number of outputs. 
Many classification schemes in fact directly use PCA. Other dimension reduction 
methods utilize the same or similar algorithms to those used for the actual data 
mining: an ANN can perform PCA when set up as an autoencoder, and kernel 
methods can act as generalizations of PCA. A binary genetic algorithm f i)2.4.4p 
can be used in which each individual represents a subset of the training attributes 
to be used, and the algorithm selects the best subset. The epsilon-approximate 
nearest neighbor searclP^ reduces the dimensi ona lity of nearest neighbor methods. 
Other methods include information bottlenedff^, which directly uses information 
theory to optimize the trad eoff between the number of classes and the informa- 
tion contained, Fisher Matrijiff^, Independent Component AnalysiP^, and wavelet 
transforms. The curse of dimensionality is likely to worsen in the future for a similar 
reason to that of missing values, as more multiwavelength datasets become avail- 
able to be cross-matched. Classification and dimension reduction are not identical 
of course: a classification algorithm may build a model to represent the data, which 
is then applied to further examples to predict their classes. 



2.4. Selection and Use of Machine Learning Algorithms 

Machine learning algorithms broadly divide into supervised and unsupervised meth- 
ods, also known as predictive and descriptive, respectively. These can be general- 
ized to form semi-supervised methods. Supervised methods rely on a training se@ 
of objects for which the target property, for example a classification, is known with 
confidence. The method is trained on this set of objects, and the resulting mapping 
is applied to further objects for which the target property is not available. These 
additional objects constitute the testing set. Typically in astronomy, the target 
property is spectroscopic, and the input attributes are photometric, thus one can 
predict properties that would normally require a spectrum for the generally much 
larger sample of photometric objects. The training set must be representative, i.e., 
the parameter space covered by the input attributes must span that for which the 
algorithm is to be used. This might initially seem rather restrictive, but in many 
cases can be handled by combining datasets. For example, the zCOSMOS redshift 
survejJ^ni^ at one square degree, provides spectra to the depth of the photometric 
portion of the Sloan Digital Sky Survey (SDSsiSU, r ~ 22 mag, which covers over 
8000 square degrees. Since SDSS photometry is available for zCOSMOS objects, 
one can in principle use the 40,000 zCOSMOS galaxies as a training set to assign 
photometric redshifts to over 200 million SDSS galaxies. 

In contrast to supervised methods, unsupervised methods do not require a train- 
ing set. This is an advantage in the sense that the data can speak for themselves 
without preconceptions such as expected classes being imposed. On the other hand. 



°For many astronomical applications, one might more properly call it a training sample, but the 
term training set is in widespread use, so we use that here to avoid confusion. 
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if there is prior information, it is not necessarily incorporated. Unsupervised algo- 
rithms usually require some kind of initial input to one or more of the adjustable 
parameters, and the solution obtained can depend on this input. 

Semi-supervised methods attempt to allow the best-of-both-worlds, and both 
incorporate known priors while allowing objective data interpretation and extrap- 
olation. But given their generality, they can be more complex and difficult to im- 
plement. They are of potentially great interest astronomically because they could 
be used to analyze a full photometric survey beyond the spectroscopic limit, with- 
out requiring priors, while at the same time incorporating the prior spectroscopic 
information where it is available. 

2.4.1. Supervised Methods 

The most widely used and well-known machine learning algorithm in astronomy 
to-date, referred t o as far b ack as the mid 1980s,l^ is the artificial neural net- 
work (ANN, Fig. [Tj |23 | 24 | 2 5j 'pjj^jg consists of a series of interconnected nodes with 
weighted connections. Each node has an activation function, perhaps a simple 
threshold, or a sigmoid. Although the original motivation was that the nodes would 
simulate neurons in the brain, the ANNs in data mining are of such a size that 
they are best described as nonlinear extensions of conventional statistical methods. 

The supervised ANN takes parameters as input and maps them on to one or 
more outputs. A set of parameter vectors, each vector representing an object and 
corresponding to a desired output, or target, is presented. Once the network is 
trained, it can be used to assign an output to an unseen parameter vector. The 
training uses an algorithm to minimize a cost function. The cost function, c, is 
commonly of the form of the mean-squared deviation between the actual and desired 
output: 



where Ok and tk are the output and target respectively for the fcth of N objects. 

In general, the neurons could be connected in any topology, but a commonly 
used form is to have an a : bi : b2 bn : c arrangement, where a is the 

number of input parameters, are the number of neurons in each of n one 

dimensional 'hidden' layers, and c is the number of neurons in the final layer, which 
is equal to the number of outputs. Each neuron is connected to every neuron in 
adjacent layers, but not to any others. Multiple outputs can each give the Bayesian 
a posteriori probability that the output is of that specific class, given the values of 
the input parameters. 

The weights are adjusted by the training algorithm. In astronomy this has typ- 

oQ on QA 

ically been either the well-known backpropagation algorithm ° ^ or the quasi- 
Newton algorithn]23l^ although other algorithms, such as Levenberg-MarquardtP^^^ 
have also been used. 
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Fig. 1. Schematic of an artificial neural network for an object with n attributes, a hidden layer 
of size p, an d a single continuously-valued output, in this case, the redshift, z. From Firth, Lahav 
& SomervillfE] 



Another common method used in data mining is the decision tree (DT, Fig. 
[^||34]35 36 37 38; pgcigjon trees consist first of a root node which contains all of the 
parameters describing the objects in the training set population along with their 
classifications. A node is split into child nodes using the criterion that minimizes 
the classification error. This splitting subdivides the parent population group into 
children population groups, which are assigned to the respective child nodes. The 
classification error quantifies the accuracy of the classification on the test set. The 
process is repeated iteratively, resulting in layered nodes that form a tree. The iter- 
ation stops when specific user-determined criteria are reached. Possibilities include 
a minimum allowed population of objects in a node (the minimum decomposition 
population), the maximum number of nodes between the termination node and the 
root node (the maximum tree depth), or a required minimum decrease resulting 
from a population split (the minimum error reduction). The terminal nodes are 
known as the leaf nodes. The split is tested for each input attribute, and can be 
axis-parallel, or oblique, which allows for hyperplanes at arbitrary angles in the 
parameter space. The split statistic can be the midpoint, mean, or median of the 
attribute values, while the cost function used is typically the variance, as with ANN. 

In recent years, anothe r algorithm, the support vector machine (SVM, Fig. 
[3j|4 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 4 8|^ j^g^g gaii^g(j popularity in astronomical data mining. 

SVM aims to find the hyperplane that best separates two classes of data. The 
input data are viewed as sets of vectors, and the data points closest to the classi- 
fication boundary are the support vectors. The algorithm does not create a model 
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Fig. 2. As Fig.[T] but showing a decision tree. The oblique planes specified by the division criteria 
on the input attributes xi and X2 at the nodes in this case divide the input parameter space into 
three regions. From Salzberg et aZ.l22l. 

of the data, but instead creates the decision boundaries, which are defined in terms 
of the support vectors. The input attributes are mapped into a higher dimensional 
space using a ker nel so that nonlinear relationships within the data become linear 
(the 'kernel trick' and the decision boundaries, which are linear, are determined 
in this space. Like ANN and DT, the training algorithm minimizes a cost function, 
which in this case is the number of incorrect classifications. The algorithm has two 
adjustable hyperparameters: the width of the kernel, and the regularization, or cost, 
of classification error, which helps to prevent overfitting ( H2.5^ of the training set. 
The shape of the kernel is also an adjustable parameter, a common choice being the 
Gaussian radial basis function. As a result, an SVM has fewer adjustable parameters 
than an ANN or DT, but because these parameters must be optimized, the training 
process can still be computationally expensive. SVM is designed to classify objects 
into two classes. Various refinements exist to support additional classes, and to per- 
form regression, i.e., to supply a continuous output value instead of a classification. 
Classification probabilities can be output, for example, by using the distance of a 
data point from the decision boundary. 

Ai iother powerfu l but computationally intensive method is k nearest neighbor 
(fc]\jNj5115 2 | 53 | 54 | 55 | r^j^^^ method is powerful because it can utilize the fuU infor- 
mation available for each object, with no approximations or interpolations. The 
training of fcNN is in fact trivial: the positions of each of the objects in the input at- 
tribute space are simply stored in memory. For each test object, the same attributes 
are compared to the training set and the output is determined using the properties 
of the nearest neighbors. The simplest implementation is to output the properties 
of the single nearest neighbor, but more commonly the weighted sum of k nearest 
neighbors is used. The weighting is typically the inverse Euclidean distance in the 
attribute space, but one can also use adaptive distance metrics. The main drawback 
of this method is that is it computationally intensive, because for each testing object 
the entire training set must be examined to determine the nearest neighbors. This 
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Fig. 3. As Fig. [T] but showing a support vector machine. The circled points are the support 
vectors between the two classes of objects, represented by open and filled circles. The cases shown 
are separable and non-separable data with linear and nonlinear boundaries, w is the normal to the 
hyperplane, and b is the perpendicular distance. From Huertas-Company et ali^^, to which the 
reader is referred for details of 



requires a large number of distance calculations, since the test datasets are often 
much larger than the training datasets. The workload can be mitigated by storing 
the training set in an optimized data structure, such as a kd-tree. 

However, in the past few years, novel supercomputing hardware (which is dis- 
cussed in more detail in §4.7p has become available that is specifically designed to 
carry out exactly this kind of computationally intensive work, including applica- 
tions involving a large number of distance calculations. The curve of growth of this 
technology exceeds that of conventional CPUs, and thus the direct implementa- 
tion of fcNN using this technology has the potential to exceed the performance of 
conventional CPUs. 



2.4.2. Unsupervised Methods 

Kernel density estimation (KDEj^^ ^ l ^^ l ^^ l ^Q I ^ JJ^ is a method of estimating the 
probability density function of a variable. It is a generalization of a histogram where 
the kernel function is any shape instead of the top-hat function of a histogram bin. 
This has the advantages that it avoids the discrete nature of the histogram and does 
not depend on the position of the bin edges, but the width of the kernel must still be 
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chosen so as not to over- or under-smooth the data. A Gaussian kernel is commonly 
utilized. In higher numbers of dimensions, common in astronomical datasets, the 
width of the kernel m ust be specified in each dimension. 

K-means c/ttsiermfpSEH jg g^j^ unsupervised method that divides data into clus- 
ters. The number of clusters must be initially specified, but since the algorithm con- 
verges rapidly, many starting points can be tested. The algorithm uses a distance 
criterion for cluster membership, such as the Euclidean distance, and a stopping 
criterion for iteration, for example, when the cluster membership ceases to change. 

Mixture models^^ decompose a distribution into a sum of components, each 
of which is a probability density function. Often, the distributions are Gaussians, re- 
sulting in Gaussian mixture models. They are often used for clustering, but also for 
density estimation, and they can be optimized using either expectation maximiza- 
tion or Monte Carlo methods. Many astronomical datasets consist of contributions 
from different populations of objects, which allows mixture modeling to disentangle 
these population groups. Mixture models based on the EM algorithm have been 
used in astronomy for this purpose -^y-. 

Expectation maximization 

(EMj69|70|71| 

treats the data as a sum of probability 
distributions, which each represent one cluster. This method alternates between an 
expectation stage and a maximization stage. In the expectation stage, the algorithm 
evaluates the membership probability of each data point given the current distribu- 
tion parameters. In the maximization stage, these probabilities are used to update 
the parameters. This method works well with missing data, and can be used as the 
unsupervised component in semi-supervised learning ( ij2.4.3p to provide class labels 
for the supervised learning. 

The Kohonen self- organizing map is an unsupervised neural net- 

work that forms a general framework for visualizing datasets of more than two 
dimensions. Unlike many methods which seek to map objects onto a new output 

space, the SOM is fundamentally topological. This is neatly illustrated by the fact 

74 

that one astronomical SOM application-*-^ is titled 'Galaxy Morphology Without 
Classification'. A related earlier method i s learning vect or quantization''''^. 
Independent component analysis 

(ICAf6|77|19|78|7E 

an example of blind source 

separation, can separate nonlinear components of a dataset, under the assumption 
that those components are statistically independent. The components are found by 
maximizing this independence. Related statistical methods include principal com- 
ponent analysis ( §2.3|) . singular value decomposition, and non- negative matrix fac- 
torization. 



2.4.3. Semi-Supervised 
The semi-supervised 

approac#QIMJ 

has been somewhat underused to-date, but 
holds great potential for the upcoming, large, purely photometric surveys. Super- 
vised methods require a labeled training set, but will not assign new classes. On the 
other hand, unsupervised methods do not require training, but do not use existing 
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known information. Semi-supervised methods aim to capture the best from both of 
these methods by retaining the abihty to discover new classes within the data, and 
also incorporating information from a training set when available. An example of a 
dataset amenable to the approach is shown in Fig. |4l 

This is particularly relevant in astronomical applications using large amounts of 
photometric and a more limited subsample of spectroscopic data, which may not 
be fully representative of the photometric sample. The semi-supervised approach 
allows one to use the spectral information to extrapolate into the purely photometric 
regime, thereby allowing a scientist to utilize all of the vast amount of information 
present there. 

Semi-supervised learning represents an entire subfield of data mining research. 
Given the nontrivial implementation requirements, this field is a good area for po- 
tential fruitful collaborations between astronomers, computer scientists, and statis- 
ticians. As one example of a possible issue, a lot of photometric data are likely to 
be a direct continuation in parameter space of spectroscopic data, with, therefore, a 
highly overlapping distribution. This means that certain semi-supervised approaches 
will work better than others, because they contain various assumptions about the 
nature of the labeled and unlabeled data. 
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Fig. 4. Dataset amenable to semi-supervised learning, showing labeled and unlabeled classes, 
denoted by 1-4 and U1-U4, respectively. The axes are arbitrary units. The crosses result from a 
mixture model applied to the data. From Bazell & MilleilSZl. 
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2.4.4. Other Algorithms 

In § §2.4.1It2.4.2l above. we described the main data mining algorithms used to date 
in astronomy, however, there are numerous additional algorithms available, which 
have often been utilized to some extent. These algorithms may be employed at 
more than one stage in the process, such as attribute selection, as well as the 
classification/regression stage. 

While neural networks in sonie very broad s ense mimic the learning mechanism 
of the brain, genetic mimic natural selection, as the most 

successful individuals created are those that are best adapted for the task at hand. 

The simplest implementation is the binary genetic algorithm, in which each 
'individual' is a vector of ones and zeros, which represent whether or not a particular 
attribute, e.g., a training set attribute, is used. From an initial random population, 
the individuals are combined to create new individuals. The fitness of each individual 
is the resulting error in the training algorithm run according to the formula encoded 
by the individual. This process is repeated until convergence if found, producing the 
best individual. 

A typical method of combining two individuals is one-point crossover, in which 
segments of two individuals are swapped. To more fully explore the parameter space, 
and to prevent the algorithm from converging too rapidly on a local minimum, a 
probability of mutation is introduced into the newly created individuals before they 
are processed. This is simply the probability that a zero becomes a one, or vice- 
versa. An approximate number of individuals to use is given by riin ~ 2n/ log(n^), 
where rif is the number of attributes. The algorithm converges in nit ^ auf log(n^) 
iterations, where a is a problem-dependent constant; generally a > 3. 

Numerous refinements to this basic approach exist, including using continuous 
values instead of binary ones, and more complex methods for c om bining individuals. 
Further possibilities for the design of genetic algorithms exislP^, and it is possible 
in principle to combine the optimization of the learning algorithm and the attribute 
set. 

The Information bottleneck methocJl^ is based directly on information theory 
and is designed to achieve the best tradeoff between accuracy and compression for 
the desired number of classes. The i nputs and outputs are probability density func- 
tions. Association rule niinin^SQEl] jg method of finding qualitative rules within a 
database such that a rule derived from the occurrence of certain variables together 
implies something about the occurrence of a variable not used in creating that rule. 
The false discovery rai^22l jg method of establishing a significant discovery from 
a smaller set of data than the usual n sigma hypothesis test. 

This list could continue, broadening into traditional statistical methods such as 
least squares, and regression, as well as Bayesian methods, which are widely used 
in astronomy. For brevity we do not consider these additional methods, but we 
do note that graphical model^ are a general way of describing the interrelation- 
ships between variables and probabilities, and many of the data mining algorithms 
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described earlier, such as ANNs, are special cases of these models. 
2.4.5. Choice of Algorithm 

Unfortunately, there is no simple method to select the optimal algorithm to use, 
because the most appropriate algorithm can depend not only on the dataset, but 
also the application for which it will be employed. There is, therefore, no single best 
algorithm. Likewise, the choice of software is similarly non- triv ial. Many general 
frameworks exist, for example wekaISI or Data to Knowledgj^, but it is unlikely 
that one framework will be able to perform all steps necessary from raw catalog 
to desired science result, particularly for large datasets. In Table [1] we summarize 
some of the advantages and disadvantages of some of the more popular and well- 
known algorithms used in astronomy. We do not attempt to summarize available 
software. Various other general comparisons of machine learning algorithms exisJ^, 
as well as numerous studies comparing various algorithms for particular datasets, a 
field which itself is rather comple»24l_ 





Table 1. Advantages and disadvantages of well-known machine learning algorithms in astronomy. These algorithms, and others, are described in more 
detail in ^ i|2.4.m2.4.4l 



Algorithm 


Advantages 


Disadvantages 


Artificial Neural Network 


Good approximation of nonlinear functions 

Easily parallelized 

Good predictive power 

Extensively used in astronomy 

Robust to irrelevant or redundant attributes 


Black-box model 
Local minima 

Many adjustable parameters 

Affected by noise 

Can overfit 

Long training time 

No missing values 


Decision Tree 


Popular real-world data mining algorithm 

Can input and output numerical or categorical variables 

Interpretable model 

Robust to outliers, noisy or redundant attributes 
Good computational scalability 


Can generate large trees that require pruning 

Generally poorer predictive power than ANN, SVM or kNN 

Can overfit 

Many adjustable parameters 


Support Vector Machine 


Copes with noise 

Gives expected error rate 

Good predictive power 

Popular algorithm in astronomy 

Can approximate nonlinear functions 

Good scalability with number of attributes 

Unique solution (no local minima) 


Harder to classify > 2 classes 

No model is created 

Long training time 

Poor interpretability 

Poor at handling irrelevant attributes 

Can overfit 

Some adjustable parameters 


Nearest Neighbor 


Uses all available information 
Does not require training 
Easily parallelized 
Few or no adjustable parameters 
Good predictive power 


Computationally intensive 
No model is created 

Can be affected by noise and irrelevant attributes 


Expectation Maximization 


Gives number of clusters in the data 

Fast convergence 

Copes with missing data 

Can give class labels for semi-supervised learning 


Can be biased toward Gaussians 
Local minima 
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2.5. Improving Results 

Many of the algorithms previously described involve 'greedy' optimization. In these 
cases, the cost function, which is the measure of how well the algorithm is performing 
in its classification or prediction task, is minimized in a way that does not allow 
the value of the function to increase much if at all. As a result, it is possible for the 
optimization to become trapped in a local minimum, whereby nearby configurations 
are worse, but better configurations exist in a different region of parameter space. 
Various approaches exist to overcome local minima. One approach is to simply 
run the algorithm several times from different starting points. Another approach is 
simulated annealinip^ ^''^ where, in following the metallurgical metaphor, the 
point in parameter space 'heats up', thus perturbing it and allowing it to escape 
from the local minimum. The point is allowed to 'cool', thus having the ability to 
find a solution closer to the global minimum. 

Models produced by data mining algorithms are subject to a fundamental lim- 
itation common to many systems in which a predictive model is constructed, the 
bias-variance tradeoff. The bias is the accuracy of the model in describing the data, 
for example, a linear model might have a higher bias than a higher order polyno- 
mial. The variance is the accuracy of this model in describing new data. The higher 
order polynomial might have a lower bias than a linear model, but it might be 
more strongly affected by variations in the data and thus have a higher variance. 
The polynomial has overfit the data. There is usually an optimal point between 
minimizing bias and minimizing variance. A typical way to minimize overfitting 
is to measure the performance of the algorithm on a test set, which is not part 
of the training set, and adjusting the stopping criterion for training to stop at an 
appropriate location. 

To help prevent overfitting, training can also be regularized, in which an extra 
term is introduced into the cost function to penalize configurations that add com- 
plexity, such as large weights in an ANN. This complexity can cause a function to 
be less smooth, which increases the likelihood of overfitting. As is the case with 
supervised learning, unsupervised algorithms can also overfit the data, for example, 
if some kind of smoothing is employed but its scalelength is too small. In this case, 
the algorithm will 'fit the noise' and not the true underlying distribution. 

Another common approach to control overfitting and improve confidence in the 
accuracy of the results is cross-validation, where subsets of the data are left out of 
the training and used for testing. The simplest form is the holdout method, where 
a single subset of the training data is kept out of the training, and the algorithm 
error is evaluated by running on this subset. However, this can have a high bias 
(see bias-variance tradeoff, above) if the training set is small, due to a significant 
portion of the training information being left out. isT- fold cross-validation improves 
on this by subdividing the data into K samples and training on K — I samples, 
or alternatively using K random subsets. Typically, K = 5 or K — 10, as small K 
could still have high bias, as in the holdout method, but large K, while being less 
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biased, can have high variance due to the testing set being smah. If K is increased 
to the size of the dataset, so that each subsample is a single point, the method 
becomes leave-one-out cross-vahdation. In all instances, the estimated error is the 
mean error from those produced by each run in the cross-validation. 

Another important refinement to running one algorithm is the ability to use 
a committee of instances of the algorithm, each with different parameters. One 
can allow these different instantiations to vote on the final prediction, so that the 
majority or averaged result becomes the final answer. Such an arrangement can 
often provide a substantial improvement, because it is more likely that the majority 
will be closer to the correct answer, and that the answer will be less affected by 
outliers. One such committee arrangement is bootstrap aggregat^ng, or hagg^nm 
where random subsamples with replacement (bootstrap samples) are taken, and the 
algorithm trained on each. The created algorithms vote on the testing set. Bagging 
is often applied to decision trees with considerable success, but it can be applied to 
other algorithms. The combination of bagging and the random selection of a small 
subset of features for splitting at each node is known as a Random Forest"""^ -'^'-'Ql. 

Boosting uses the fact that several 'weak' instances of an algorithm can be 
combined to produce a stronger instance. The weak learners are iteratively added 
and misclassified objects in the data gain higher weight. Thus boosting is not the 
same as bagging because the data themselves are weighted. Boosted decision trees 
are a popular approach, and many different boosting algorithms are available. 

As well as committees of the same algorithm, it is also possible to combine the 
results of more than one different algorithm on the same dataset. Such a mixture of 
experts approach often provides an optimal result on real data. The outcome may 
be decided by voting, or the output of one algorithm can form the input to another, 
in a chaining approach. 

For many astronomical applications, the results are, or would be, significantly 
improved by utilizing the full probability density function (PDF) for a predicted 
property, rather than simply its single scalar value. This is because much more 
information is retained when using the PDF. Potential uses of PDFs are described 
further in g?Tl 

2.6. Application of Algorithms and Some Limitations 

The purpose of this review is not to uncritically champion certain data mining 
algorithms, but to instead encourage scientific progress by exploiting the full poten- 
tial of these algorithms in a considered scientific approach. We therefore end this 
section by outlining some of the limitations of and issues raised by KDD and the 
data mining approach to current and future astronomical datasets. Several of these 
problems might be ameliorated by increased collaboration between astronomers and 
data mining experts. 

• Extrapolation: In many astronomical applications, it is common for data with less 
information content to be available for a greater number of objects over a larger 
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parameter space. The classic example is in surveys where photometric objects are 
typically observed several magnitudes fainter than spectroscopic objects. For a 
supervised learning algorithm, it is usually inappropriate to extrapolate beyond 
the parameter space for which the training set (e.g., the spectroscopic objects) is 
representative. 

• Non-intuitive results: It is very easy to run an implementation of a well-known 
algorithm and output a result that appears reasonable, but is in fact either statis- 
tically invalid or completely wrong. For example, randomly subsampled training 
and testing sets from a dataset will overlap and produce a model that overfits the 
data. 

• Measurement error. Most astronomical data measurements have an associated 
error, but most data mining algorithms do not take this explicitly into account. 
For many algorithms, the intrinsic spread in the data corresponding to the target 
property is the measurement of the error. 

• Adjustable parameters: Several algorithms have a significant number of adjustable 
parameters, and the optimal configuration of these parameters is not obvious. 
This can result in large parameter sweeps that further increase the computational 
requirement. 

• Scalability: Many data mining algorithms scale, for n objects, as n^, or even worse, 
making their simple application to large datasets on normal computing hardware 
intractable. One can often speed up a naive implementation of an algorithm that 
must access large numbers of data points and their attributes by storing the data 
in a hierarchical manner so that not all the data need to be s earch ed. A popular 
hierarchical structure for accomplishing this task is the kd-tredini]. However, the 
implementation o f such trees for large datasets and on parallel machines remains 
a difficult problenffSH. 

• Learning Curve: Data mining is an entire field of study in its own right, with 
strong connections to statistics and computing. The avoidance of some of the 
issues we present, such as the selection of appropriate algorithms, collaboration 
where needed, and the full exploitation of their potential for science return, require 
overcoming a substantial learning curve. 

• Large datasets: Many astronomical datasets are larger than can be held in ma- 
chine memory. The exploitation of these datasets thus requires more sophisticated 
database technology than is currently employed by most astronomical projects. 

• ''It's not science": The success of an astronomical project is judged by the science 
results produced. The time invested by an astronomer in becoming an expert in 
data mining techniques must be balanced against the expected science gain. It is 
difficult to justify and obtain funding based purely on a methodological approach 
such as data mining, even if such an approach will demonstrably improve the 
expected science return. 

• It does not do the science for you: The algorithms will output patterns, but will 
not necessarily establish which patterns or relationships are important scientifi- 
cally, or even which are causal. The truism 'correlation is not causation' is apt 
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here. The successful interpretation of data mining resuhs is up to the scientist. 
• The result can only be as good as the data: Related to this, the single largest factor 
in the success of any data mining algorithm is the quality of the input data. If the 
data are not sufhcient for the task, or are poorly collected or incorrectly treated, 
the result will not be useful. 



3. Uses in Astronomy 

We now turn to the use of data mining algorithms in astronomical applications, 
and their track record in addressing some common problems. Whereas in ^ we 
introduced terms for the astronomer unfamiliar with data mining, here for the non- 
expert in astronomy we briefly put in context the astronomical problems. However, 
a full description is beyond the scope of this review. Whereas |j2]was subdivided ac- 
cording to data mining algorithms and issues, here the subdivision is in terms of the 
astrophysics. Throughout this section, we abbreviate data mining algorithms that 
are either frequently mentioned or have longer names according to the abbreviations 
introduced in ^ PCA, ANN, DT, SVM, /cNN, KDE, EM, SOM, and ICA. 

Given that there is no exact definition of what constitutes a data mining tool, 
it would not be possible to provide a complete overview of their application. This 
section therefore illustrates the wide variety of actual uses to date, with actual 
or implied further possibilities. Uses which exist now but will likely gain greater 
significance in the future, such as the time domain, are largely deferred to 21 Several 
other overviews of applications of machine learning algorithm s in astronom y exist, 
and contain further examples, including ones for AN N^Q^'^Q^ ' ^O^ I ^O^ I ^O'^ , B'^^, 
genetic algorithms and stellar classificatiorfflS!, 

Most of the applications in this section are made by astronomers utilizing data 
mining algorithms. However, several projects and studies have also been made by 
data mining experts utilizing astronomical data, because, along with other fields 
such as high energy physics and medicine, astronomy has produced many large 
datasets that are amenable to the approach. Examples of su ch projects include the 
Sky Image Cataloging and Analysis System 

(SKICAT|ni] 

for catalog production 

and analysis of catalogs from digitized sky surveys, in particular the scans of the 
second Palomar Observator y Sk y Survey; the Jet Propulsion Laboratory Adaptive 
Recognition Tool (JARTooljll2l," used for recognition of volcanoes in the over 30,000 
images of Venus retur ned by the Magellan mission; the subsequent and more gen- 
eral D iamo nd EyjU^l and the Lawrence Livermore National Laboratory Sapphire 
projeclP31. A recent review of data mining from this perspective is given by Kamath 
in the book Scientific Data Mining^^^ . In general, the data miner is likely to employ 
more appropriate, modern, and sophisticated algorithms than the domain scientist, 
but will require collaboration with the domain scientist to acquire knowledge as to 
which aspects of the problem are the most important. 
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3.1. Object classification 



Classification is often an important initial step in the scientific process, as it provides 
a method for organizing information in a way that can be used to make hypotheses 
and to compare with models. Two useful concepts in object classification are the 
completeness and the efficiency, also known as recall and precision. They are defined 
in terms of true and false positives (TP and FP) and true and false negatives (TN 
and FN) . The completeness is the fraction of objects that are truly of a given type 
that are classified as that type: 

TP 

completeness 



TP + FN' 



and the efficiency is the fraction of objects classified as a given type that are truly 
of that type 

TP 

efficiency 



TP + FP 



These two quantities are astrophysically interesting because, while one obviously 
wants both higher completeness and efficiency, there is generally a tradeoff involved. 
The importance of each often depends on the application, for example, an inves- 
tigation of rare objects generally requires high completeness while allowing some 
contamination (lower efficiency), but statistical clustering of cosmological objects 
requires high efficiency, even at the expense of completeness. 



3.1.1. Star- Galaxy Separation 

Due to their small physical size compared to their distance from us, almost all stars 
are unresolved in photometric datasets, and thus appear as point sources. Galaxies, 
however, despite being further away, generally subtend a larger angle, and thus 
appear as extended sources. However, other astrophysical objects such as quasars 
and supernovae, also appear as point sources. Thus, the separation of photometric 
catalogs into stars and galaxies, or more generally, stars, galaxies, and other objects, 
is an important problem. The sheer number of galaxies and stars in typical surveys 
(of order 10^ or above) requires that such separation be automated. 

This problem is a well studied one and automated approaches were employed 
even before current data mining algorithms became popular, for example, du ring 
digitization by the scanning of photographic plates by machines such as the 
and DP0Ssil3. Several data mini ng algori thms have been e mplo yed, incl uding 
AN ^llX | n9 | 12() | 121 | 122 | 123 | 124 | dtSMMI, fixture modeling, and SOIvffM 

with most algorithms achieving over 95% efficiency. Typically, this is done using a 
set of measured morphological parameters that are derived from the survey pho- 
tometry, with perhaps colors or other information, such as the seeing, as a prior. 
The advantage of this data mining approach is that all such information about each 
object is easily incorporated. As well as the simple outputs 'star' or 'galaxy', many 
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of the refinements descr ibed in fj2] have improved results, including probabilistic 
outputs and baggin^l^. 



3.1.2. Galaxy Morphology 

As shown in Fig. [SJ galaxies come in a range of different sizes and shapes, or 
more collectively, morphology. The most well-known system for the morphological 
classification of galaxies is the Hubble Sequence of e lliptical, spiral, barred spiral, 
and irregular, along with various subclasses^^^ ^'^^ 131|132|133|134| ^jjjg system 
correlates to many phys ical prop erties known to be important in the formation 
and evolution of 

galaxiefPSil36j_ 

Other well-known classification systems ar e the 
Yerkes sys tem base d on concentrati on index - I^S^IM.^ ^he de Vaucouleur j^^^ l, 
exponentiaff^IIEZl^ and Sersic measures of the galaxy light profile, 

the David Dunlap Observatory (DDO} 

systen l^45 | 14(i | 147 J 

and the concentration- 

asymmetry-clumpiness (CAS) 

systen|l3BI 

Because galaxy morphology is a complex phenomenon that correlates to the 
underlying physics, but is not unique to any one given process, the Hubble sequence 
has endured, despite it being rather subjective and based on visible- light morphology 
originally derived from blue-biased photographic plates. The Hubble sequ ence has 
been extended in various ways, and for data mining purposes the T s 

ystenPSEin] 

has been extensively used. This system maps the categorical Hubble types E, SO, 
Sa, Sb, Sc, Sd, and Irr onto the numerical values -5 to 10. 

One can, therefore, train a supervised algorithm to assign T types to images for 
which measured parameters are available. Such parameters can be purely morpho- 
logical, or in clude other information such as color. A series of papers by Lahav and 
coUaborator j^^^ l ^^^ l ^^'^I^^^ I ^Q^ I ^ do exactly this, by applying ANNs to predict 
the T type of galaxies at low redshift, and finding equal accuracy to human ex- 
perts. ANNs have also been appl ied to higher redshift data to distinguish between 
normal and peculiar galaxiea^^^, and the fundamentally topological and unsuper- 
vised SOM ANN has been used to classify galaxies from Hubble Space Telescope 
imageJ^, where the initial distribution of classes is not known. Like wise, ANNs 
have been used to obtain morphological types from galaxy spectra.^^^ 

Several authors study galaxy morphology at higher redshift by using the Hub- 
ble Deep Fields, where the galaxies are generally r nuch more di stant, fainter, less 
evolved, and morphologically peculiar. Three 

studiepsmni] 

use ANNs trained 

on surface brightnes s and light profiles to classify galaxies as E/SO, Sabc and Sd/Irr. 
Another applicatiorff^ uses Fourier decomposition on galaxy images followed by 
ANNs to detect bars and assign T types. 
Bazeh & Ah£p3l 

uses ensembles of classifiers, including ANN and DT, to reduce 
the classification error, and BazelPl studies the importance of various measured 
input attributes, findin g th at no single measured parameter fully reproduces the 
classifications. Ball et a ZpSDobtain similar results to Nairn et ali^^, but updated for 
the SDSS. Ball et al^^and Ball, Loveday & BrunneitlSIl utilize these classifications 
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Fig. 5. Examples of galaxy morphology showing many aspects of the information available to, and 
issues to be aware of for, a data mining process. These include galaxy shape, structure, texture, 
inclination, arm pitch, color, resolution, exposure, and, from left to right, redshift, in this case 
artificially constructed. From Barden, Jahnke & HaufileiffSII. 

in studies of the bivariate luminosity function and the morphology-density relation 
in the SDSS, the first such studies to utilize both a digital sky survey of this size 
and detailed Hubble types. 
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Because of the complex nature of galaxy morphology and the plethoraof avail- 
able approaches, a large number of further studies exist: Kelly & McKajU^SI (Fig. 
|6l) demonstrate improvement over a simple split in u — r using r riixtu re models, 
within a schema that incorporates morphology. Serra-Ricart et al^^^ use an en- 
coder ANN to reduce the dimensionality of various data sets and perform several 
applications, including morphology. Adams & WooUejff^use a committee of ANNs 
in a 'waterfall' arrangement, in which the output from one ANN formed the input 
to another which produces more detailed classes, improving their results. Molinari 
& Smaregli use an SOM to identify E /SO g alaxies in clusters and measure their 
luminosity function, de Theije & KatgertP^^ split E/SO and spiral galaxies using 
spectral principal component s and stu dy their kinematics in clusters. Genetic al- 
gorithms have been employecff^^lllll for attri bute selection and to evolve ANNs 
to classify 'bent-double' galaxies in the radio survey data. Radio mor- 

phology combines the compact nucleus of the radio galaxy and extremely long jets. 
Thus, the bent-double morphology indicates the presence of a galaxy cluster, de la 
Calleja & Fuentes-'^'''^ comb ine e nsembles of ANN and locally weighted regression. 
Beyond ANN, Spiekermaniff^ uses fuzzy algebra and heuristic methods, antici- 
pating the importance of probabilistic studies ^ §4.ip that are just now beginning 
to emerge. Owens, Griffiths & Ratnatung use oblique DTs, obtaining similar 
results to ANN. Zhang, Li & Zhao^^ distinguish early and late types using k- 
means clustering. S VM s have r ecently been employed on the COSMOS survey by 
Huertas- Company et a/! 50 | 180 ] g^abling early-late separation to Kab = 22 mag 
twice a s goo d as the CAS system. SVMs will also be used on data from the Gaia 
satellitePI. 

Recently, the popular Galaxy Zoo project!!^ has taken an alternative approach 
to morphological classification, employing crowdsourcing: an application was made 
available online in which members of the general public were able to view images 
from the SDSS and assign classifications according to an outlined scheme. The 
project was very successful, and in a period of six months over 100,000 people 
provided over 40 million classifications for a sample of 893,212 galaxies, mostly 
to a limiting depth of r = 17.77 mag. The classifications included categories not 
previously assigned in astronomical data mining studies, such as edge-on or the 
handedness of spiral arms, and the project has produced multiple scientific results. 
The approach represents a complementary one to automated algorithms, because, 
although humans can see things an algorithm will miss and will be subject to dif- 
ferent systematic errors, the runtime is hugely longer: a trained ANN will produce 
the same 40 million classifications in a few minutes, rather than six months. 



3.1.3. Other Galaxy Classifications 

Many of the physical properties, and thus classification, of a galaxy are determined 
by its stellar population. The spectrum of a galaxy is therefore another method for 
classificatioiPilHl^ 

and can sometimes produce a clearer link to the underlying 
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Fig. 6. Improvement in classification using a mixture model over that derived from the u and r 
passbands {u — r color). In this case, the mixture model clearly delineates the third class, which is 
not seen using u — r. The axes are the first two principle components of the spectro- morphological 
parameter set (shapelet coefficients in five passbands) describing the galaxies. The light contours 
are the square root of the probability density from the mixture model fit, and the dark contours 
are the 95% threshold for each class, in the right-hand panel fitted to the two classes by quadratic 
discriminant analysis. From Kelly &; McKayiSSJ 



physics than the morphology. Spectral classification is important because it is pos- 
sible for a range of morphological types to have the same spectral type, and vice 
versa, because spectr al types are driv en by different underlying physical processes. 

Numerous studiej^- ^^ l '*^^^ ! '*^^'^ ! '*^^^! have used PCA directly for spectral classifi- 
cation. PCA is also often us ed a s a preprocessing step be fore the classification of 
spectral types using an Folkes, Lahav & Maddoa P^^ predict morpholog- 

ical types for the 2dF Galaxy Redshift Survey 

(2dFGRs|EI] 

using spectra, and 

Ball et al.-^^ directly predict spectral types in the SDSS using an ANN. Slonim 
et a/. use the information bottleneck approach on the 2dFGRS spectra, which 
max imall y preserves the spectral information for the desired number of classes. Lu 
et a/ .j ^^'^ l use ensemble learning for ICA on components of galaxy spectra. Abdalla 
et al^^ use ANN and locally weighted regression to directly predict emission line 
properties from photometry. 

Bazell & Millei^ applied a semi-supervised method suitable for class discovery 
using ANNS to the ESO-LV^^S ^^^^ gpgg ^^^y^ j^^^^^ Release (EDR) catalogs. They 
found that a reduction of up to 57% in classification error was possible compared to 
purely supervised ANNs. The larger of the two catalogs, the SDSS EDR, represents a 
preliminary dataset about 6% of the final data release of the SDSS, clearly indicating 
the as-yet untapped potential of this approach. The semi-supervised approach also 
resembles the hybrid empirical-template approach to photometric redshifts ( §3.2p . 
as both seek to utilize an existing training set where available even if it does not 
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span the whole parameter space. However, the approach used by Bazell & Miller 
is more general, because it allows new classes of objects to be added, whereas the 
hybrid approach can only iterate existing templates. 

3.1.4. Quasars/AGN 

Most of the emitted electromagnetic radiation in the universe is either from stars, 
or the accretion disks surrounding supermassive black holes in active galactic nu- 
clei (AGN). The latter phenomenon is particularly dramatic in the case of quasars, 
where the light from the central region can outshine the rest of the galaxy. Be- 
cause supermassive black holes are thought to be fairly ubiquitous in large galaxies, 
and their fueling, and thus their intrinsic brightness, can be influenced by the en- 
vironment surrounding the host galaxy, quasars and other AGN are important for 
understanding the formation and evolution of structure in the universe. 

The selection of quasars and other AGN from an astronomical survey is a well- 
known and important problem, and one well suited to a data mining approach. It 
is well-known that different wavebands (X-ray, optical, radio) will select different 
AGN, and that no one waveband can select themall. Traditionally, AGN are classi- 
fied on the Baldwin-Phillips- Terlevich diagranJl^, in which sources are plotted on 
the two-dimensional space of the emission line ratios [Oiii] A 5007 / H/3 and [Nil] 
/ Ha, that is separated by a single curved line into star-forming and AGN regions. 
Data mining not only improves on this by allowing a more refined or higher dimen- 
sional separation, but also by including passive objects in the same framework (Fig. 
[21). This allows for the probability that an object contains an AGN to be calculated, 
and does not require all (or any) o f the emission lines to be detected. 

Several groups have used ANnJM™1MI or DT j200l^01 | 1^6 | 202 | 203 | 204 | 205 | 

to select quasar candidates from surveys. White et a/ !^^^ I show that the DT method 
improves the reliability of the selection to 85% co mpar ed to only 60% for simpler 
criteria. Ot her a lgorithms employed include PCA I^^^ I, SVM and learning vector 
quantizatiorf2Q3, kd-tree^QSi^ clustering in the form of p rinci pal surfaces and nega- 
tive entropy clusterin j^^^ l, and kernel density estimatiorPISI. Many of these papers 
combine multiwavelength data, particularly X-ray, optical, and radio. 

Similarly, one can select and classify candidates of all types of AG^. If mul- 
tiwavelength data are available, the characteristic data mining algorithm ability 
to form a model of the required complexity to extract the information could en- 
able it to use the full information to extract more complete AGN samples. More 
generally, one can classify both normal and active galaxies in one system, differen- 
tiating between star formation and AGN. As one example, DTs have been use 
to select quasar candidates in the SDSS, providing the probabilities P(star, galaxy, 
quasar). P(star formation, AGN) could be supplied in a similar framework. Bam- 

919.. 

ford et al. combine mixture modeling and regression to perform non-parametric 
mixture regression, and is the first study to obtain such components and then study 
them versus environment. The components are passive, star-forming, and two types 
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of AGN. 

3.1.5. Other Classifications 

Often, the first component of classification is the actual process of object detec- 
tion, which often is done at some signal-to-noise threshold. Several statistical data 
mining algorithms have been employed, and software packages written, for this pur- 
pose, inclu ding the Faint Object Classification and Analysis System (FOCAsjSni, 
DA0PH0t211^ Source Extractor (SExtractor}2lSl, maximum like lihood, wavelets, 
mixture 

modeliSni^ and ANN JED. Serra-Ricart et al^^ show that ANNs 
are able to classify faint objects as well as a Bayesian classifier but with considerable 
computational speedup. 

Several studies are more general than star-galaxy separation or galaxy classifi- 
cation, and assign cl assifi cations of varying detail to a broad range of astrophysical 



objects. Goebel et al^^ apply the AutoClass Bayesian classifier to the IRASLRS 
atlas, finding new and scientifically interesting object classes. McGlynn et al^^use 
oblique DTs in a system called ClassX to classify X-ray objects into stars, white 
dwarfs. X-ray binaries, galaxies, AGN, and clusters of galaxies, concluding that the 
system has the potential to significantly incre ase the known populations of some 
rare object types. Suchkov, Hanisch & Margorl^ ^ us e the same system to classify 
objects in the SDSS. Bazell, Miller & Subbarad^A apply semi-supervised learn- 
ing to SDSS spectra, including those classified as 'unknown', finding two classes of 
objects consisting of over 50% unknown. 

Stellar classifications are necessarily either spectral or based on color, due to 
the pointlike nature of the source. This field has a long history and well established 
results such as the HR diagram and the OBAFGKM spectral sequence. The latter 
is extended to a two-dimensional system of spectral type and luminosity classes 
I-V to form the two-dimensional MK classification system of Morgan, Keenan & 
Kellmai|222]. Class I are supergiants, through to class V, dwarfs, or main-sequence 
stars. The spectral types correspond to the hottest and most massive stars, O, 
through to the coolest and least massive, M, and each class is subdivided into ten 
subclasses 0-9. Thus, the MK classification of the sun is G2V. 

The use of automated algorithms to assign MK classes is analogous to that for 
assigning Hubble types to galaxies in several ways: before automated algorithms, 
stellar spectra were compared by eye to standard examples; the MK system is 
closely correlated to the underlying physics, but is ultimately based on observable 
quantities; the system works quite well but has been extended in numerous ways 
to incorporate objects that do not fit the main classes (e.g., L and T dwarfs, Wolf- 
Rayet stars, carbon stars, white dwarfs, and so on). Two differences from galaxy 
classification are the number of input parameters, in this case spectral indices, and 
the number of classes. In MK classification the numbers are generally higher, of 
order 50 or more input parameters, compared to of order 10 for galaxies. 

Given a large body of work for galaxies that has involved the use of artificial neu- 



August 11, 2010 0:44 WSPC/INSTRUCTION FILE ijmpd 



28 N. M. Ball & R. J. Brunner 



AGN dominated 

Se5rfert 




LINER 



-1 -0.5 0.5 

log([NII]X6583/Ha) 




0.2 0.4 0.6 0.8 1 



Fig. 7. Upper panel: Baldwin-Pliilips-Terlevich diagram, which classifies active galactic nuclei 
(AGN) and star-for ming galaxies but requires all four emission lines to be present in the spectrum. 
Prom Bamford et ai!2i2J (although it should be noted that the use of this diagram is not the basis 
of their study). The axes are the diagnostic emission line ratios from the spectra. Lower panel: 
AGN/star-forming/passive classification using an ANN, which has no such requirement. The axes 
are the two outputs from the ANN, ei and 62 mapped onto (ei,e2) = (ei + 62/2)1 + e2j, where 
passive, AG N. sta r-forming, and hybrid are (0,0), (1,0), (0,1), and (0.5,0.5), respectively. From 
Abdalla et a0^. 
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ral networks, and the similarities just outlined, it is not surprising tha t simila r ap- 
proaches have been employed for stellar classification^-^^ ^24 225 226 2 27 | 228 |^ ^j^j^ 
a typical accuracy of one spectral type and half a luminosity type. The relatively 
large number of object attributes and output classes compared to the number of 
objects in each class does not invalidate the approach, because the efforts described 
generally find that the number of principal components represented by the inputs 
is typically much lower. A well-known property of neural networks is that they are 
robust to a large number of redundant attributes ( §2.4.5|) . 

Neural n etwo rks have been used for other stellar classifications schemes, e.g. 
Gupta et aZ.I22ni define 17 classes for IRAS sources, including planetary nebulae 
and Hi I regions. Other methods have been employed; a recent example is Manteiga 
et al^^^-, who use a fuzzy logic knowledge-based system with a hierarchical tree 
of decision rules. Beyond the MK and other static classificatio ns, v ariable stars 
have been extensively studied for many years, e.g., Wozniak et al^^use SVM to 
distinguish Mira variables. 

The detection and characterization of supernovae is important for both under- 
standing the astrophysics of these events, and their use as standard candles in 
cons train ing aspects of cosmology such as the dark energy equation of state. Bailey 
et aZ.l232] boosted DTs, random forests, and SVMs to classify supernovae in 
difference images, finding a ten times reduction in the false-positive rate compared 
to standard techniques involving parameter thresholds (Fig. 

Given the general nature of the data mining approach, there are many fur- 
ther clas sifica tion examples, includin g cosmic ray hits— planetary nebula j^^H^ 
asteroidJ^25|^ and gamma ray sources^3SI13ZI_ 



3.2. Photometric redshifts 

An area of astrophysics that has greatly increased in popularity in the last few years 
is the estimation of redshifts from photometric data (photo-zs). This is because, al- 
though the distances are less accurate than those obtained with spectra, the sheer 
number of objects with photometric measurements can often make up for the re- 
duction in individual accuracy by suppressing the statistical noise of an ensemble 
calculation. 

Phot o-zs were first demonstrated in the mid 20th centurj^^SHESH^ later in 
the 1980i23ni241]. in the 1990s, t he advent of the Hubble Space Telescope Deep fields 
resulted in numerous a 

pproache j^42 | 243 | 244 | 245 | 246 | 247 | 248 ] 

reviewed by 

In the past decade, the advent of wide-field CCD surveys and multifiber spec- 
troscopy have revolutionized the study of photo-zs to the point where they are 
indispensable for the upcoming next generation surveys, and a large number of 
studies have been made. 

The two common approaches to photo-zs are the template method and the 
empi rical training set method. The template approach has many complicating 
issueJ^^, including calibration, zero-points, priors, multiwavelength performance 
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Fig. 8. Improvement in the classification of supernovae using suppo rt vec tor machine and decision 
tree, compared to previously used threshold cuts. From Bailey et al^^l 

(e.g., poor in the mid-infrared), and difficulty handling missing or incomplete train- 
ing data. We focus in this review on the empirical approach, as it is an imple- 
mentation of supervised learning. In the future, it is likely that a hybrid method 
incorporating both templates and the empirical approach will be used, and that 
the use of full probability density functions will become increasingly important. 
For many applications, knowing the error distribution in the redshifts is at least 
as important as the accuracy of the redshifts themselves, further motivating the 
calculation of PDFs. 



3.2.1. Galaxies 

At low redshifts, the calculation of photometric redshifts for normal galaxies is 
quite straightforward due to the break in the typical galaxy spectrum at 4000A. 
Thus, as a galaxy is redshifted with increasing distance, the color (measured 
as a difference in magnitudes) changes relatively smoothly. As a result, both 
template and empirical photo-z approaches obtain similar results, a root-mean- 
square deviation of ~ 0.02 in redshift, which is close to the best possible re- 
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suit given the intrinsie spread in the propertieJ^SU. This has be en sh own with 
ANN&33 165 156 252 253 254 124 255 256 257 179^ gVIyj^SS 259 prjj260l /.]sjj.j26T] 
empirical polynomial relation j^62|251|247 | 263|264 | 265 | 

numerous template-based 

studies, and several other methods. At higher redshifts, obtaining accurate results 
becomes more difficult because the 4000A break is shifted redward of the optical, 
galaxies are fainter and thus spectral data are sparser, and galaxies intrinsically 
evolve over time. The first explorations at higher redshift were the Hubble Deep 
Fields in the 1990s, described above f §3.2|) . and, more recently, new infrared data 
have become available, which allow the 4000A break to be seen to higher redshift, 
which improves the results. Template-based algorithms work well, provided suit- 
able templates into the infrared are available, and supervised algorithms simply 
incorporate the new data and work in the same manner as previously described. 

While supervised learning has been successfully used, beyond the spectral regime 
the obvious limitation arises that in order to reach the limiting magnitude of the 
photometric portions of surveys, extrapolation would be required. In this regime, 
or where only small training sets are available, template-based results can be used, 
but without spectral information, the templates themselves are being extrapolated. 
However, the extrapolation of the templates is being done in a more physically 
motivated manner. It is likely that the more general hybrid approach of using 
empirical data to iteratively improve the templateSj^*^*^ '^^'^ ^'^^ '^'^^ or the 
semi-supervised method described in H2. 4. 31 will ultimately provide a more elegant 
solution. Another issue at higher redshift is that the available numbers of objects 
can become quite small (in the hundreds or fewer), thus reintroducing the curse of 
dimensionality by a simple lack of objects compared to measured wavebands. The 
methods of dimension reduction (i )2.3p can help to mitigate this effect. 

3.2.2. Quasars/ AGN 

Historically, the calculation of photometric redshifts for quasars and other AGN 
has been even more difficult than for galaxies, because the spectra are dominated 
by bright but narrow emission lines, which in broad photometric passbands can 
dominate the color. The color-redshift relation of quasars is thus subject to several 
effects, including degeneracy, one emission line appearing like another at a different 
redshift, an emission line disappearing between survey filters, and reddening. In 
addition, the filter sets of surveys are generally designed for normal galaxies and 
not quasars. The assignment of these quasar photo- 2:s is thus a complex problem 
that is amenable to data mining in a similar manner to the classification of AGN 
described in §3.1.41 

The calculation of qua sar photo-zs has had some success using SDSS 
dat j^^^ l ^^'^ l ^^'^ l ^^^ l ^^^ ! ^^'' ! but they suffer from catastrophic failures, in which, as 
shown in Fig. O the photometric redshift for a subset of the objects is completely 
incorrect. However, data mining approaches have resulted in improvements to this 
situation. Ball et aZ.l^^find that a single- neighbor A:NN gives a similar result to the 
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templates, but multiple neighbors, or other supervised algorithms such as DT or 
ANN, pull in the regi ons o f catastrophic failure and significa ntly decrease the spread 
in the results. KumaJ^^ also shows this effect. Ball et al P^^ go further and are 
able to largely eliminate the catastrophics by selecting the subset of quasars with 
one peak in their re dshif t probability density function (i )4.ip . a result confirmed by 
WolPSni. Wolf et alW^ also show significant improvement using the COMBO-17 
survey, which has 17 filters compared to the five of the SDSS, but unfortunately the 
photometric sample is much smaller. „^,^ 

Beyond the spectral regime, template-based results are suflicienlf^Sll^ but again 
suffer from catastrophics. Given our physical understanding of the nature of quasars, 
it is in fact reasonable to extrapolate in magnitude when using colors as a training 
set, because while one is going to fainter magnitudes, one is not extrapolating in 
color. One could therefore quite reasonably assign empirical photo- zs for a full 
photometric sample of quasars. 



3.3. Other Astrophysical Applications 

Typically in data mining, information gathered from spectra has formed the train- 
ing set to apply a predictive technique to objects with photometry. However, it is 
clear from this process that the spectrum itself contains a large amount of infor- 
mation, and data mining techniques may be used directly on the spectra to extract 
information that might otherwise remain hidden. Applications to galaxy spectral 
classification were described in i i3.1.3l In stellar work, besides the classification of 
stars into the MK system based on observable parameters, several studies have di- 
rectly predicted physical parameters of ste llar atmospheres using spectral indices. 
One example is Ramirez, Fuentes & GulatPS^ 

who utilize a genetic algorithm to 
select the appropriate input attributes, and predict the parameters using fcNN. The 
at tribu te selection reduces run time and improves predictive accuracy. Solorio et 
aim use fcNN to study stellar populations and improve the results by using active 
learning to populate sparse regions of parameter space, an alternative to dimension 
reduction. 

Although it has much potential for the future f §4.2|) . the time domain is a field 
in which a lot of work has already been done. Examples include the classification 
of variable stars described in §3.1.51 and, in order of distance, the interaction of the 
solar wind and the Earth's atmosphere, transient lunar phenomena, detection and 
classification of asteroids and other solar system objects by composition and or- 
bit, solar system planetary atmospheres, stellar proper motions, extrasolar planets, 
novae, stellar orbits around the supermassive black hole at the Galactic center, mi- 
crolensing from massive compact halo objects, supernova e, gam ma ray bursts, and 
quasar variability. A good overview is provided by Becke r I The large potential 
of the time domain for novel discovery lies within the as yet unexplored parameter 
space defined by depth, sky coverage, and temporal resolutiorPSHl. One constraining 
characteristic of the most variable sources beyond the solar system is that they are 
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Fig. 9. Photometric redshift, Zp, vs. spectroscopic redshift, Zs, for quasars in the Sloan Digital 
Sky Survey, showing, in the upper panel, catastrophic failures in which Zp is very different from 
Zb- Each individual point represents one quasar, and the contours indicate areas of high areal 
point density, cr is the root-mean-square dispersion between Zp and Zb- The use of data mining 
techniques, including assigning full probability density functions in photometric redshift, enables 
the reduction or eli mina tion of these catastrophics, as shown in the lower panel. Data based on 
that from Ball et al\^ 
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generally point sources. As a result, the timescales of interest are constrained by 
the light crossing time for the source. 

The analysis of the cosmic microwave background (CMB) is amenable to several 
techniques, including Bayesianmo deling, wavelets, and ICA. The latter, in particu- 
lar via the FastICA algorithrcPISl^ has been used in re nioya l of CMB foregro und j^ ^^ j 
and cluster detection via the Sunyaev-Zeldovich effectPHl. Philhps & KogutP^use 
a committee of ANNs for cosmological parameter estimation in CMB datasets, by 
training them to identify parameter values in Monte Carlo simulations. This gives 
unbiased parameter estimation in considerably less processing time than maximum 
likelihood, but with comparable accuracy. 

One can use the fact that objects cross-matched between surveys will likely have 
correlated distributions in their measured attributes, for example, similar position 
o n the sky, to improve cross-matching results using pattern classifiers. Rohde et 
aim combine distribution estimates and probabilistic classifiers to produce such 
an improvement, and supply probabilistic outputs. 

Taylor & Diaz'^^-'^ obtain empirical fits for Galactic metallicity using ANNs, 
whose architectures are evolved using genetic algorithms. This method is able to 
provide equations for metallicity from line ratios, mitigating the 'black box' element 
common to ANNs, and, in addition, is potentially able to identify new metallicity 
diagnostics. 

Bogdanos & NesseriJ^^ analyze Type la supernovae using genetic algorithms 
to extract constraints on the dark energy equation of state. This method is non- 
parametric, which minimizes bias from the necessarily a priori assumptions of para- 
metric models. 

Lunar and planetary science, space science, and sola r phy sics also provide many 
examples of data mining uses. One example is Li et all \ who demonstrate im- 
provements in solar fiare forecasting resulting from the use of a mixture of experts, 
in this case SVM and fcNN. The analysis of the abundance of minerals or con- 
stituents in soil sampleJ^ll using mixture models is another example of direct data 
mining of spectra. 

Finally, data mining can be performed on astronomical simulations, as well as 
real datasets. Modern simulations can rival or even exceed real datasets in size and 
complexity, and as such the data min ing approach can be appropriate. An exam- 
ple is the incorporation of theorjJ^SSl into the Virtual Observatory ( ij4.5p . Mining 
simulation data will present extra challenges compared to observations because in 
general there are fewer constraints on the type of data presented, e.g., observations 
are of the same universe, but simulations are not, simulations can probe many as- 
trophysical processes that are not directly observable, such as stellar interiors, and 
they provide direct physical quantities as well as observational ones. Most of the 
largest simulations are cosmological, but they span many areas of astrophysics. A 
prominent cosmological simulation is the Millennium Ru n"^^^' , and over 200 papers 



August 11, 2010 0:44 WSPC/INSTRUCTION FILE 



ijmpd 



Data Mining in Astronomy 35 



have utilized its datc|j. 
4. The Future 

We now turn to the future of data mining in astronomy. Several trends are apparent 
that indicate likely fruitful directions in the next few years. These trends can be 
used to make informed decisions about upcoming, very large surveys. This section 
assumes that the reader is somewhat familiar with the concepts in both §ij2]and[3l 
namely, with both data mining and astronomy. We once again arrange the topics 
by data mining algorithm rather than by astronomical application, but we now 
interweave the algorithms with examples. 

As in the past, it is likely that cross- fertilization with other fields will continue to 
be beneficial to astronomy, and of particular relevance here, the data mining efforts 
made by these fields. Examples include high energy physics, whose most obvious 
spinoff is the World Wide Web from CERN, but the subject has an extensive history 
of extremely large datasets from experiments such as particle colliders, and has 
prov ided well-known and commonly used data analysis software such as ROOT 
^223, designed to cope with these data sizes and first developed in 1994. In the fields 
of biology and the geosciences, the concepts of informatics, the study of computer- 
based information systems, have been extensively utilized, creating the subfields 
of bio- and geoinformatics. The official recognition of an analogous subfield within 
astronomy, astroinformatics, has recently been recommendecJ^. 

4.1. Probability Density Functions 

A probability density function (PDF, Fig. llO|) is a function such that the probability 
that the value, x, is in the interval a < a; < 6, is the definite integral over the range: 



Thus the total area under the function is one. PDFs are of great significance for data 
mining in astronomy because they retain information that is otherwise lost, and be- 
cause they enable results with improved signal-to-noise from a given dataset. One 
can think of a PDF as a histogram in the limit of small bins but many objects. Ap- 
proaches such as supervised learning are in general taking as input the information 
on objects and providing as output a prediction of properties. The most general way 
to do this is to work with the full PDFs at each stage. The formalism has recently 
been demonstrated in an astronomical context by Budavarl^Zl]^ a,nd it is applica- 
ble to the prediction of any astronomical property. For inputs a, &, c,..., the output 
probabilities of a set of properties, P{x, y, z, ...) can be predicted. Fully pr obab ilistic 
cross-matching of surveys has also been implemented by the same authoJ^SSl. 



P{a < X <b) 




|http : //www . mpa-garchlng . mpg ■ de/mlllennium] 
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Fig. 10. Example photometric redshift probability density functions (PDFs) for galaxies, showing 
the rich content of extra information compared to a single value, or value plus Gaussian error. The 
horizontal axes, z, are the photometric redshifts, and the vertical are the probability densities. 
The solid red and dotted blue lines are the PDF with and without the photometric uncertainties, 
respecti vely, and the vertical dashed green lines are, in these cases, the true distances. From 
BudavarJSll. 



Results with PDFs in photo-zs are startingto appear, either with single values 
and a spread, or the full P DF. C unha et oL^^Ml show that full PDFs help reduce 
bias. Margoniner & WittmaiPni 

show that they enable subsamples with impr oved 
signal-to-noise, and WittmarlSni] also demonstrates reduction in error. Ball et al^^ 
show that generating full photo-z PDFs for quasars allows subsection of a sample 
virtually free of catastrophic failures, the first time this has been demonstrated, 
and an impo rtant result for their use as tracers of the large s cale s tructure in the 
universe. WolP^ confirms a similar result. Myers, White & Ball^^show that using 
the full PDF for clustering measurements will improve the signal-to-noise by four 
to five times for a given dataset without any alteration of the data (Fig. [TT|) . This 
method is applicable to the clustering of any astronomical object. Full PDFs have 
also be en s hown to improve performance in the photometric detection of galaxy 



clusters^QSl^ again due to the increased signal-to-noise ratio. Several further efforts 
use a single photo-z and a spread, but not the full PDF. However, the method of 
Myers, White & Ball shows that it is the ful l PD F that will give the most benefit. 
PDFs will also be important for weak lensing 
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As well as photo-zs, predicting properties naturally incorporates probabilistic 
classification. Progress has be en rn ade, e.g., the SDSS has been classified accord- 
ing to P(galaxy, star, neither j^^H. Similar classifications that could b e ma de are 
P(star formation, AGN) and P(quasar, not quasar). Bailer- Jones et aL^^SH imple- 
ment probabilistic classification that emphasizes finding very rare objects, in this 
case quasars amo ng the stars that will be seen by Gaia. 

Ball et aL^26l] generate a PDF by perturbing inputs for a single-neighbor fcNN. 
The idea of perturbing data has been studied in the field of Privacy Preserving Data 
Mining'^'-''^ "^^^^ but here the aim is to generate a PDF using the errors on the input 
attributes in a way that is computationally scalable to upcoming datasets. The 
approach appears to work well despite the fact that at present, survey photometric 
errors are generally poorly characterized^Q^ Proper characterization of errors will 
be of great importance to future surveys as the probabilistic approach becomes more 
important. Scalability may be best implemented either by using kd-tree like data 
structures, or by direct implementation on novel supercomputing hardware such as 
FPGA, GPU, or Cell processors ( §4.7p . which can provide enormous performance 
benefits for applications that require a large number of distance calculations. 




Fig. 11. Improvement in the signal-to-noise ratio of the clustering signal of quasars enabled by 
PDFs. The improvements to the projected correlation function (vertical axis) enabled by utilizing 
PDFs are shown by the green crosses and red triangles, compared to the old method, based on 
single-valued photometric redshifts, shown by blue diamonds. The horizontal axis is the projected 
radial distance between objects. The diagonal lines are power-law fits, with scale length rg, to the 
correlation function. The points are offset for clarity. From Myers, White & Ball222l 



August 11, 2010 0:44 WSPC/INSTRUCTION FILE 



ijmpd 



38 N. M. Ball & R. J. Brunner 

4.2. Real-Time Processing and the Time Domain 

The time domain is already a significant area of study and will become increasingly 
important over the next decade with the advent of lar ge scale synoptic surveys such 



resolved observations over large areas of the sky remains an unexplored area, and 
the histor ical precedent suggests that many interesting phenomena remain to be 
discoverecP^. 

However, as one might expect, this field presents a number of challenges not 
encountered in the data mining of static objects. These include (i) how to handle 
multiple observations of objects that can vary in irregular and unpredictable ways, 
both intrinsic and due to the observational equipment, (ii) objects in difference im- 
ages (the static background is subtracted, leaving the variation), (iii) the necessarily 
extremely rapid response to certain events such as gamma ray bursts where physical 
information can be lost mere seconds after an event becomes detectable, (iv) robust 
classification of large streams of data in real time, (v) lack of previous information 
on several phenomena, and (vi) the volume and storage of time domain information 
in databases. Other challenges are seen in static data, but will assume increased im- 
portance as real-time accuracy is needed. For example, the removal of artifactJ^OH 
that might otherwise be flagged as unusual objects and incur expensive follow-up 
telescope time. Variability will be both photometric, a change in brightness, and 
astrometric, because objects can move. While some astronomical phenomena, such 
as certain types of variable stars, vary in a regular way, others vary in a nonlin- 
ear, irregular, stochastic, or chaotic manner, and the variability itself can change 
with time (heteroskedasticityj^I^. Time series analysis is a well developed area of 
statistics, and many of these techniques will be useful. 

The combination of available information, but incomplete coverage of the pos- 
sible phenomena suggests that a probabilistic (i j4.1[) approacli--— ', either involving 
priors, or semi-supervised fi )2.4.3|) will in general be the most appropriate. This is 
because the algorithms can use the existing information, but objectively interpret 
new phenomena. Supervised learning will perform better for problems where more 
information and larger datasets are available, and unsupervised or Bayesian priors 
will perform better when there are fewer observations. Many events will still require 
followup observations, but since there will be far more events than can ever be fol- 
lowed up in detail, data mining algorithms will help ensure that the observations 
made are optimal in terms of the targeted scientific results. 

As a confederation of data archives and interoperable standards of many of the 
world's existing telescopes, the Virtual Observatory (VO, §4.5|) will be crucial in 
meeting the challenge of the time domain, and significant infrastructure for the 
VO already exists. The 

VOEventNetPEI 

is a system for the rapid handling of real 
time events, and provides an online federated data stream of events from several 
telescopes. It can be followed by both human observers and robotic telescopes. 
Numerous next-generation wide-field surveys in the planning or construction 



as 



the Large Synoptic Survey Telescope (LSST)^. A large number of temporal 
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stages will be synoptic. The largest such survey in the optical is the LSST, which 
will observe the entire sky, visible from its location, every three nights. These ob- 
servations will provide a data stream exceeding one petabyte per year, and, as a 
result, they anti cipat e many of the challenges described her d'^ '^ i Like 
the Gaia satellite^l^ has working groups dedicated to data mining. The Classifica- 
tion Working Group ha s employed several data mining techniques, and developed 
new approacheJ^I^E^ to be used when the survey conies online. Other ongoing 



upcoming syn optic surveys include Palomar-Quesli^^, the Catalina Real-Time 
Transient SurvejE^, Pan-STARRS'^"'^^, and those at other wavele ngth s such as 



looni 

instruments leading up to and including the Square Kilometer Arra y I 

The time domain will not only provide challenges to existing methods of data 
mining, but will open up new avenues for t he ex traction of information, such as using 
the variability of objects for classificatiorP211 or photometric 

redshiflP22. 

Because 

they are due to a relatively compact source in the center of galaxies, active galactic 
nuclei vary on much shorter timescales than normal galaxies. This variability has 
been proposed as a mechanism to select quasar and other AG N can didates. Other 
events are suspected theoretically but have not been observecP^. But given the 
dataset sizes, automated detection of such events at some level is clearly required. 
The computational demands of real time processing of the enormous data streams 
from these surveys is significant, and will likely be met by the use of newly emerging 
specialized computing hardware f iJ4.7p . 



4.3. Petascale Computing 

The current state of the art in supercomputing consists of terabyte-sized files and 
teraflop computing speeds, which is conve niently encapsulated in the term teras- 
cale computing. Following Moore's la'vJ^MI^ -^^hich computer performance has in- 
creased exponentially for the last sever al dec ades, the coming decade will feature 
the similarly-derived petascale computin^^^. Much of the performance increase in 
the past decade has been driven by increases in processor (CPU) clock frequency, 
but this rate has now slowed due to physical limitations on the sizes of components, 
and more importantly power consumption and energy (heat) dissipation. It has 
therefore become more economical to manufacture chips with multiple processor 
cores. 

The typical supercomputer today is a cluster, which consists of a large number 
of conventional CPUs connected by a specialized interconnect system, a distributed 
or shared memory, a shared filesystem, and hosting the Linux operating system. 
Many systems are heterogeneous because this is scalable and cost-effective, but 
coordinating and making effective such a system can be challenging. In particular, 
it will be vital that the system is properly balanced between processing power and 
disk input /output (I/O) to supply the data. Combined with the increasing number 
of processor cores, this means that parallel and distributed computing is rapidly 
increasing in importance. 
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A useful set of 'rules of thumb' for parall el and other aspects of computing 



of these is that roughly 50,000 CPU cycles are required per byte of data. Most 
scientific datasets require far fewer cycles than this, and it is thus likely that future 
performance will be I/O limited, unless sufficient disks are provided in parallel. 
Bell, Gray & Szalay""*^ estimate that a petascale system will require 100,000 one 
TB disks. The exact details of how to distribute the data for best performance are 
likely to be system-dependentP^. The available CPU speed should scale to the data 
size, although it will not scale to most naively implemented data mining algorithms 



An example of an upcoming petascale system whose uses will include astronom- 
ical data mining is the Blue Water^ system at the National Center for Supercom- 
puting Applications (NCSA), which is due to come online in 2011. Specifications 
include 200,000 compute cores with 4 GHz 8 core processors, 1 PB of main memory, 
10 PB of user disk storage, 500 PB of archival storage, and 400 GB s~^ bandwidth 
connectivity to provide sustained petascale compute power. It will imp lement the 
IBM PERCS (Productive, Easy-to-use, Reliable Computer 

System|321l^ 

which will 

integrate their CPU, operating system, parallel programming, and file systems. This 
provides a method of addressing the issues of running real- world applications at the 
petascale by balancing the CPU, I/O, networking, and so on. Similarly, a consider- 
able investment of effort is being carried out in the years leading up to deployment 
in 2011 on the development of applications for the system, in consultation with the 
scientists who will run them. Several astronomical applications are included, mostly 
simulations, but also data mining in the form of the analysis of LSST datasets. 

Not all petascale computing will be done on systems as large as Blue Waters. 
In the US, the National Science Foundation Office of Cyberinfrastructure has been 
advisecffi to implement a power-law type system, with a small number of very large 
systems, of order ten times more regional centers, and ten times more local facilities 
(Tiers 1-3). Such local facilities, for example Beowulf clusters, are already common 
in university departments, and consist of typically a few dozen commodity machines. 
A recent trend matching the increasing requiremen ts fo r data-intensive as opposed 
to CPU-intensive computing is the GrayWulf cluster which implements the idea 
of data 'storage bricks': cheap, modular, and portable versions of a balanced system 
which when added together provide petascale computation. 

4.4. Parallel and Distributed Data Mining 

As indicated in i i4.3l above. because of the slowing increase in raw speed of individual 
CPUs, processors are becoming increasingly parallelized, both in terms of the num- 
ber of processor cores on a single chip, and increasing amounts of these chips being 
deployed in parallel on supercomputing clusters. Providing appropriately scaled sys- 



" ^http : //www ■ ncsa . uluc ■ edu/BlueWater s] 
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remain true today. One 
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terns (CPU, I/O, etc.) is one challenge, but most data mining algorithms not only 
will be required to run on petascale data, but their nai've implementations scale 
as iV^, or worse. It has been suggested^^SI that any algorithm that scales beyond 
NlogN will rapidly be render ed in feasible. 

Mc Connell and Sk illicorr|S20l have promoted parallel and distributed data 
minin^ '^^ l '^'^^ l '^'^'^ l '^'^'^ l, which is well-known in the data mining field, but virtually 
unused in astronomy. In this approach, the algorithms explicitly take advantage of 
available parallelism. The simplest example is task-farming, or the embarrassingly 
parallel approach, in which a task is divided into many mutually-independent sub- 
tasks, each of which is allocated to a single processor. This can be done on an array 
of ordinary desktop machines as well as a supercomputer. A more complex challenge 
is when many parts of the data must be accessed, or when an algorithm relies on the 
outputs from calculations distributed across multiple compute nodes. For a large 
dataset the hardware required likely includes shared memory (i i4.3|) . thus shared 
memory parallelization^^^ can be important. Many algorithms exist for the imple- 
mentation of data mining on parallel computer systems beyond simple task farming, 
but these are not widely used within science, as compared to the commercial sector. 
The application programming interfaces MPI and OpenMP have been widely used 
on distributed and shared memory systems, respectively, for simulation and some 
data analysis, but they do not offer the semantic capabilities"^"^*^ needed for data 
mining, i.e., the metadata describing the meaning of the data being processed and 
the results produced are not easily incorporated. 

Parallel data mining is challenging, as not only must the algorithm be imple- 
mented on the hardware, but many algorithms simply cannot be ported as-is to 
such a system. Instead, parallelization requires that the algorithm itself, as encap- 
sulated in the code, must often be fundamentally altered at the pseudocode level. 
This can be a time-consuming and counterintuitive process, especially to scientists 
who are generally not trained or experienced in parallel programming. Progr ess is 
slowly being made in astronomy, including a parallel implementation of kd-tree J^^^, 
cosmological simulations requiri ng d atasets larger than the node memory sizj^SZI 
and parallelization of algorithms^^SI. 

An alternative approach is grid computing, in which the exact resource used is 
unimportant to the user, although not all data mining algorithms lend themselves to 
this paradigm. A variant of grid computing is crowdsourcing, in which members of 
the public volunteer their spare CPU cycles to process data for a project. The most 
well-known project of this type is SETI@Home, and more recently, the Galaxy Zoo 
project, which employed large numbers of people to successfully classify galaxies in 
SDSS images. Such crowdsourcing is likely to become even more important in the 
future, particularly in combination with greatly improve d out reach via astronomical 
applications on social networking sites such as Facebool J'^'^^ i 

Scalability is also helped on conventional CPUs by the employment of tree 
structures, such as the kd-tree, which partition the data. This enables a search 
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to access any data value without searching the whole dataset. Kd-trees have been 
used for many astronomical applications, including speeding up N-po int c orrelation 
functionJSMf 

cross-matching, classification, and photometric redshiftJSm. Th ey can 
be extended to more sophisticated structures, for example, the multi-treJ232l, How- 
ever, implementation of such tree s truct ures on parallel hardware or computational 
accelerators ( ij4.7p remains difficuhP^. 



4.5. The Virtual Observatory 

The Virtual Observatory (VO) is an analogous concept to a physical observatory, 
but instead of telescopes, various centers house data archives. The VO consists of 
numerous national-level organizations, and the International Virtual Observatory 
Alliance. Within the national organizations there are various data centers that house 
large datasets, computing facilities to process and analyze them, and people with 
considerable expertise in the datasets stored at that particular center. 

Common data standards and web services are necessary for the VO to work. 
Such standards have emerged, including web services using XML and SOAP, a data 
format, VOTablc'lQl, a query language based on SQL, t he As tronomical Data Query 
Language*^^* ^, ini age access protocols for images (SIAPISSI), and spectra (SSAPjfl, 
VOEventNetpl^l for the time domain, plus various standards of inter oper ability 
and ways of describing resources such as the Unified Content Descriptor^^. Large 
numbers of high level tools for working with data are also availablqj. 

An example of the er nergi ng data standards for archiving is the Common 
Archive Observation 

ModePS 

(CAOM) of the Canadian Astronomical Data Cen- 
ter (CADC). Given that it is likely that the future VO will continue to consist of a 
number of data centers like the CADC, this model represents a useful and realistic 
way in which data can be made meaningfully accessible, but not so rigidly presented 
as to prevent the desired analysis of future researchers with as yet unforeseen science 
goals. This model consists of the components Artifact, Plane, SimpleObservation, 
and CompositeObservation, which describe logical parts of the data from individ- 
ual files to logical sets of observations such as spectra, and forms the basis of all 
archiving activity at the CADC. 

The increasing immobility of large datasets as described in m.'3\ will render it 
uneconomical in terms of time and money to download large datasets to local ma- 
chines. Rather than bringing the data to the analysis, it will become more sensible 
to take the analysis to the 

datEiMSi. 

To be able to perform complicated data min- 
ing analyses, it is necessary that the data be organized well enough to make this 
tractable, and that the center archiving the data must have sufficient computing 
power and web services to perform the analyses. The organizational requirement 
means that the data must be stored as a database with the sophistication found in 



■^http : //www. ivoa.net/Docmii6iits 
' http : //www . us- VO . org | 
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the commercial sector, where mining of terascale databases is routine. Commercial 
software and computer science expertise will help, but the task is non-trivial because 
astronomical data analysis can require particular data types and structures not usu- 
ally found in commercial software, such as time series observations. An example of 
such a database already in place is the SDSS, and its underlying schema^ ' has 
been used and copied by other surveys such as GALEX. 

Nevertheless, it is likely that considerable analyses will continue to be carried out 
on smaller subsets of the data, and this data may well continue to be downloaded 
and analyzed locally, as it has been to date. If one anticipates working exclusively 
with one survey, it may still be more efficient to implement a GrayWulf-like cluster 
locally and download the complete dataset. 

Another difficult problem faced by the VO is that a significant future scientific 
benefit from large datasets will be in the cross-matching of multiple datasets, in 
particular, multiwavelength data. But if such data are distributed among different 
data centers and are difficult to move, such work may be intractable. What can 
be done, however, is to make available as part of the VO web services, tools for 
cross-matching datasets at a given center. A common data format and description, 
combined with the fact that much of the science is done from small subsets of large 
datasets, means that this is certainly tractable. As a result, it is not surprising that 
there is significant demand for such tools . 

An important consideration for the VO is that many astronomers, indeed many 
scientists in general, will want to run their own software on the data, and not simply 
a higher level tool that involves trusting someone else's code. This will be true even 
if the source code is available. Or, a scientist might wish to complete an analysis 
that is not available in a higher level tool. It is thus important that the data are 
available at a low level of processing so that one can set one's own requirements as 
needed. NASA has a categorization of data where is raw, 1 is calibrated, and 2 is a 
derived product, such as a catalog. An ideal data archive would have available well 
documented and accessible level 2 catalogs, similarly documented and accessible 
level 1 data, and perhaps not online but stored level data, to enable, for example, 
a re-reduction. 

Data have been released using the VO publishing interfaceJ^SSI^ data mining 
algorithms such as ANNs have been implemented-^'^, and applications for analyses 
with web interfa ces a. re onlin^^SlI Multiwavelength analyses are becoming more 
feasible and usefuPSl^ and it is therefore now possible, but still time-consuming, to 
perform scientific analyses using VO toolPSS. We expect this will be an area where 
considerable work will still need to be done, however, in order to fully enable the 
full exploitation of the archives of astronomy data in the future. 



4.6. Visualization 



Visualization of data is an important part of the scientific process, and the combi- 
nation of terascale computing and data mining poses obvious challenges. Common 
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plotting codes presen tly in use in astronomy include SuperMongcH, PGPlo10, Gnu- 
plolQ, and IDL0lS^^ but these are stand-alone codes that do not easily cope with 
data that cannot be co mpletel y loa ded into the available memory space. Newer tools, 
such as TOPCAtEH, VisIvdSSH^ ^nd VOMegaPlot'356' support the Virtual Obser- 
vatory standards such as VOTable and PLASTIG^^Zi for interoperability bet ween 
programs. The fuU library on which the TOPCAT program is based, STILT^258|^ 
is able to plot arbitrarily-sized datasets. 

As with hardware, software, and data analysis, collaboration with computer 
scientists and other disciplines has resulted in progress in various areas of scientific 
visualization. At Harvard, the AstroMed project at the Initi ative for Innovative 
Computing (IIC) has collaborated with medical imaging teami^SSl -pj^g rendering 
of complex multi-dimensional volumetric and surficial data is a common desire of 
both fields, and the medical imaging software was considerably more advanced than 
was typical in astronomy in terms of graphical capability. As with the creation 
and curation of databases for large datasets, collaboration with the IT sector has 
enabled significant progress and the use of tools beyond the scope of those that 
could be created by astronomers alone, such as Google Skjl^SQI. n jg likely that such 
collaboration will contin ue to increase in importance. 

The program S2PlolP^, developed at Swinburne, is motivated by the idea 
of making three-dimensional plots as easy to transfer from one medium to an- 
other (interchange) as two-dimensional plots. The exi sting familiar interface of a 
plotting code, in this case PGPlot, has been extendecPSSI enable rendering of 
multi-dimensional data on several media, including desktop machines, PDF files, 
Powerpoint-style slides, or web pages. Systems in which the user is able to i nteract 
directly with the data are also likely to play a significant role. Partivie\J2^, devel- 
oped at NCSA, enables the visualization of particulate data and some isosurfaces 
either on a desktop or in an immersive CAVE system, and several astronomical 
datasets have been visualized. Szalay, Spring el & LemsoiJSSl 

describe using graphi- 
cal processing units {^^^ to aid visualization, in which the data are preprocessed to 
hierarchical levels of detail, and only rendered to the resolution required to appear 
to the eye as if the whole dataset is being rendered. Paravie-w0 is a program designed 
for parallel processing on large datasets using distributed memory systems, or on 
smaller data on a desktop. 

Finally, in recent years, numerous online virtual wor lds have become popular, 
the most well-known of which is Second Life. 

HutESni and Djorgovskfl describe 
their interaction within these worlds, both with other astronomers in the form of 
avatars in meetings, and with datasets. While it may initially seem to be just a 



^ http : //wm . astro . princeton . edu/- rhl/sm| 

^http: //www. astro . calt6 ch.edu/-tjp/pgplot] 

^http: / /www.gnuplot . inf ol 

J http : //Idlastro . gsf c . nasa . gov| 

^http: / /www .paraview. org 

Ihttp: //blogs .discovermagazine . com/cosmicvariance/2008/11/03/giiest-post-george-djorgovski-a-new- world- overture 
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gimmicky way to have a meeting, the interaction with other avatars is described as 
'fundamentally visceral', much more so than one would expect. This suggests that, 
along with social networks for outreach, such interaction among astronomers may 
become more common, as one will be able to attend a meeting without having to 
travel physically. 

4.7. Novel Supercomputing Hardware 

For the final part of fj4l we turn to novel supercomputing hardware. This is a rapidly 
developing area, but it has enormous potential to speed up existing analyses, and 
render previously impossible questions tractable. Specialized hardware has been 
used in astronomy for many year s, bu t until recently only in limited contexts and 
applications, such as the systems designed specifically for n-body cal- 

culations, or direct processing of data in instrument-specific hardware. Here, we 
describe three hardware formats that have emerged in recent years as viable solu- 
tions to a more general range of astronomical problems: graphical processing units 
(GPUs), field-programmable gate arrays (FPGAs), and the Cell processor. 

As described in M.'dl the increasing speed of CPU clock cycles has now been 
largely replaced by increasing parallelism as the main method for continuing im- 
provements in computing power. The methods described there implement coarse- 
grained parallelism, which is at the level of separate pieces of hardware or applica- 
tion processes. The hardware described here implements fine-grained parallelism, in 
which, at the instruction level, a calculation that would require multiple operations 
on a CPU is implemented in one operation. The hardware forms an intermediary 
between the previously-used application-specific integrated circuits (ASIC), and the 
general purpose CPU. 

Future petascale machines f i)4.3[) are likely to include some or all of these three, 
either as highly integrated components in a cluster-type system, or as part of the 
heterogeneous hardware making up a distributed grid-like system that has overall 
petascale performance. 

Spurred by the computer gaming industry, the GPUs on graphics cards within 
desktop-scale computers have increased in performance much more rapidly than con- 
ventional processors (CPUs). They are specially designed to be very fast at carrying 
out huge numbers of operations that are used in the rendering of graphics, by using 
vector datatypes and streaming the data. Vector processors have been used before 
in supercomputing, but GPUs have become of great interest to the scientific com- 
munity due to their commodity-level pricing, which results from their widespread 
commercial use, and the increasing ease of use for more general operations than 
certain graphical processes. 

At first, GPUs dealt only with fixed-point numbers, but now single-precision 
floating point and even double-precision are becoming more common. Thus the 
chips are no longer simply specialized graphics engines, but are becoming much 
more general-purpose (GPGPUs). Double-precision is required or highly desir- 
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able for many scientific applications. The ease of use of GPUs has been increased 
thanks to NVidia's Compute Unified Device Architecture development environment 
(CUDAofor its cards, and will be further aided by the Open Computing Language 
(OpenCL jfj for heterogeneous environments. These enable the GPU functions to be 
called in a similar way to a C library, and are becoming a de facto standard. CUDA 
has also been ported to other higher level languages, including PyCUDA in Python. 

GPUs are beginning to be used in astronomy, and several applications have 
appeared. CPUs can reproduce the functionality of the GRAPE hardware for n- 
body simulationJ2SZI^ and CUD A im plementations have been shown to outperform 
GRAPE in some circumstance^^^. GPUs are be ginni ng to be used for real-time 
processing of data from next generation instrument o I as part of the Data Intensive 
Science Consortium at the Harvard IIC. Significant speedup has been demonstrated 
of a ne arest neighbor search on a GPU compared to a kd-tree implemented in C 
on a 

are another form of hardware that has become viable for some- 
what general-purpose scientific computing. While FPGAs have been widely used 
as specialized hardware for many years, including in telescopes for data processing 
or adaptive optics, it is only in the past few years that their speed, cost, capacity, 
and ease of use have made them viable for more general use by non-specialists. As 
with GPUs, the ability to work with full double precision floating point numbers 
is also increasing, and their use is via libraries and development environments that 
enable the FPGA portion of the code to appear as just another function call in C 
or a C-like language. These tools implement the hardware description language to 
program the FPGA, which need not be known by the user. 

An FPGA consists of a grid of logic gates which must be programmed via soft- 
ware to implement a specific set of functions before running code (hence field- 
programmable). If the calculation to be performed can be fully represented in this 
way on the available gates, this enables a throughput speed of one whole calculation 
of a function per clock cycle, which given a modern FPGA's clock speed of 100 MHz 
or more, is 100 million per second. In practice, however, the actual speed is often 
limited by the I/O. 

One recent example is the direct mapping of an ANN onto an fpgaEISI, which 
can then in principle classify one object per clock cycle, or 100 million objects per 
second at 100 MHz. FPGAs will continue to be widely used as specialized compo- 
nents for astronomical systems, for example in providing real-tim e pro cessing of the 
next generation synoptic surveys. Brunner, Kindratcnko & Myer a'^'^^l demonstrated 
a significant sp eedup of the N-point correlation function using FPGAs. Freeman, 
Weeks & Austii|2l3 ' directly implement distance calculations, such as required by 
the fcNN data mining algorithm, on an FPGA. 

Finally, the IBM Cell processoJ^lSl jg ^jjjp containing a conventional CPU and 



™http: //www.nvidia. com/cuda 
" http : //www . khronos . org/openclj 
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and array of eight more powerful coprocessors for hardware acceleration in a similar 
manner to the GPU and FPGA. Like the NVidia GPU, it has been widely used in 
mass-production machines such as the Playstation 3, and is or will be incorporated 
into several 'hybrid' pctascale machines, including IBM's Roadrunncr, and possibly 
Blue Waters. Unfortunately, also like the GPU, it is not yet as easy to use as desired 
for large scale scientific use, but progress in the area is continuing. 

Further novel supcrcomputing hardware such as ClcarSpccd may become viable 
for science and widely used. It is an area of exciting developments and considerable 
potential. As with many new developments, however, one must be somewhat care- 
ful, in this case because the continued development of the hardware is driven by 
large commercial companies (NVidia, IBM, etc.), and not the scientific community. 
Nevertheless, the potential scientific gains are so large that it is certainly worth 
keeping an eye on. 

5. Conclusions 

In this review, we have introduced data mining in astronomy, given an overview of 
its implementation in the form of knowledge discovery in databases, reviewed its 
application to various science problems, and discussed its future. Throughout, we 
have tried to emphasize data mining as a tool to enable improved science, not as 
an end in itself, and to highlight areas where improvements have been made over 
previous analyses, where they might yet be made, and limitations of this approach. 

An astronomer is not a cutting-edge expert in data mining algorithms any more 
than they are in statistics, databases, hardware, software, etc., but they will need 
to know enough to usefully apply such approaches to the science problem they wish 
to address. It is likely that such progress will be made via collaboration with people 
who are experts in these areas, particularly within large projects, that will employ 
specialists and have working groups dedicated to data mining. Fully implemented, 
commercial- level databases will be required since the data will be too big to organize, 
download, or analyze in any other way. 

The available infrastructure should, therefore, be designed so that this data 
mining approach to research is maximally enabled. The raw or minimally-processed 
data should be made available in a manner so one can apply user-specific codes either 
locally or using computational resources local to the data if data size necessitates it. 
It is unlikely that most researchers will either require or trust the exact resources 
made available by higher level tools. Instead, they will be useful for exploratory 
work, but ultimately one must be able to run personal or trusted code on the data, 
from the level of re-reduction upwards. 

A problem arises when one wishes to utilize multiple or distributed datascts, for 
example in cross-matching data for multi-wavelength studies. Therefore, datasets 
that can be easily made interoperable via a standard storage schema should be 
made available. In this manner, a user can bring computing power and algorithms 
to tackle their particular science question. This problem is particularly acute when 
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large datasets are held at widely separated sites, because transfer of such data across 
the network is currently impractical. A great deal of science is done on small subsets 
of the full data, so data will still be frequently downloaded and analyzed locally, 
but the paradigm of downloading entire datasets is not sustainable. 
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